public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains
@ 2024-05-03 5:51 tnfchris at gcc dot gnu.org
2024-05-03 6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03 5:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
Bug ID: 114932
Summary: Improvement in CHREC can give large performance gains
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Target Milestone: ---
With the original fix from PR114074 applied (e.g.
g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but saw
big improvements.
The following testcase:
---
module brute_force
integer, parameter :: r=9
integer block(r, r, r)
contains
subroutine brute
k = 1
call digits_2(k)
end
recursive subroutine digits_2(row)
integer, intent(in) :: row
logical OK
do i1 = 0, 1
do i2 = 1, 1
do i3 = 1, 1
do i4 = 0, 1
do i5 = 1, select
do i6 = 0, 1
do i7 = l0, u0
select case(1 )
case(1)
block(:2, 7:, i7) = block(:2, 7:, i7) - 1
end select
do i8 = 1, 1
do i9 = 1, 1
if(row == 5) then
elseif(OK)then
call digits_2(row + 1)
end if
end do
end do
block(:, 1, i7) = select
end do
end do
end do
end do
end do
block = 1
end do
block = 1
block = block0 + select
end do
end
end
---
compiled with: -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer foo.f90
gets vectorized after sra and constprop. But the final addressing modes are so
complicated that IVopts generates a register offset mode:
4c: 2f00041d mvni v29.2s, #0x0
50: fc666842 ldr d2, [x2, x6]
54: fc656841 ldr d1, [x2, x5]
58: fc646840 ldr d0, [x2, x4]
5c: 0ebd8442 add v2.2s, v2.2s, v29.2s
60: 0ebd8421 add v1.2s, v1.2s, v29.2s
64: 0ebd8400 add v0.2s, v0.2s, v29.2s
which is harder for prefetchers to follow. When the patch was applied it was
able to correctly lower these to the immediate offset loads that the scalar
code was using:
38: 2f00041d mvni v29.2s, #0x0
34: fc594002 ldur d2, [x0, #-108]
40: fc5b8001 ldur d1, [x0, #-72]
44: fc5dc000 ldur d0, [x0, #-36]
48: 0ebd8442 add v2.2s, v2.2s, v29.2s
4c: 0ebd8421 add v1.2s, v1.2s, v29.2s
50: 0ebd8400 add v0.2s, v0.2s, v29.2s
and also removes all the additional instructions to keep x6,x5 and x4 up to
date.
This gave 10%+ improvements on various workloads.
(ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0
specialization).
I will try to reduce it more, but am filing this so we can keep track and
hopefully fix.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
@ 2024-05-03 6:26 ` rguenth at gcc dot gnu.org
2024-05-03 7:03 ` pinskia at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-03 6:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The change likely made SCEV/IVOPTs "stop" at more convenient places, but we can
only know when there's more detailed analysis.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
2024-05-03 6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
@ 2024-05-03 7:03 ` pinskia at gcc dot gnu.org
2024-05-03 8:09 ` tnfchris at gcc dot gnu.org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-03 7:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |pinskia at gcc dot gnu.org
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> which is harder for prefetchers to follow.
This seems like a limitation in the HW prefetcher rather than anything else.
Maybe the cost model for addressing mode should punish base+index if so. Many
HW prefetchers I know of are based on the final VA (or even PA) rather looking
at the instruction to see if it increments or not ...
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
2024-05-03 6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
2024-05-03 7:03 ` pinskia at gcc dot gnu.org
@ 2024-05-03 8:09 ` tnfchris at gcc dot gnu.org
2024-05-03 8:41 ` tnfchris at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03 8:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> > which is harder for prefetchers to follow.
>
> This seems like a limitation in the HW prefetcher rather than anything else.
> Maybe the cost model for addressing mode should punish base+index if so.
> Many HW prefetchers I know of are based on the final VA (or even PA) rather
> looking at the instruction to see if it increments or not ...
That was the first thing we tried, and even increasing the cost of
register_offset to something ridiculously high doesn't change a thing.
IVopts thinks it needs to use it and generates:
_1150 = (voidD.26 *) _1148;
_1152 = (sizetype) l0_78(D);
_1154 = _1152 * 324;
_1156 = _1154 + 216;
# VUSE <.MEM_421>
vect__349.614_1418 = MEM <vector(2) integer(kind=4)D.9> [(integer(kind=4)D.9
*)_1150 + _1156 * 1 clique 2 base 0];
Hence the bug report to see what's going on.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
` (2 preceding siblings ...)
2024-05-03 8:09 ` tnfchris at gcc dot gnu.org
@ 2024-05-03 8:41 ` tnfchris at gcc dot gnu.org
2024-05-03 8:44 ` tnfchris at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03 8:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
reduced more:
---
module brute_force
integer, parameter :: r=9
integer block(r, r, 0)
contains
subroutine brute
do
do
do
do
do
do
do i7 = l0, 1
select case(1 )
case(1)
block(:2, 7:, 1) = block(:2, 7:, i7) - 1
end select
do i8 = 1, 1
do i9 = 1, 1
if(1 == 1) then
call digits_20
end if
end do
end do
end do
end do
end do
end do
end do
end do
end do
end
end
---
I'll have to stop now till I'm back, but the main difference seems to be in:
good:
<Induction Vars>:
IV struct:
SSA_NAME: _1
Type: integer(kind=8)
Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81)
Step: 81
Biv: N
Overflowness wrto loop niter: Overflow
IV struct:
SSA_NAME: _20
Type: integer(kind=8)
Base: (integer(kind=8)) l0_19(D)
Step: 1
Biv: N
Overflowness wrto loop niter: No-overflow
IV struct:
SSA_NAME: i7_28
Type: integer(kind=4)
Base: l0_19(D) + 1
Step: 1
Biv: Y
Overflowness wrto loop niter: No-overflow
IV struct:
SSA_NAME: vectp.22_46
Type: integer(kind=4) *
Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
Step: 324
Object: (void *) &block
Biv: N
Overflowness wrto loop niter: No-overflow
bad:
<Induction Vars>:
IV struct:
SSA_NAME: _1
Type: integer(kind=8)
Base: (integer(kind=8)) l0_19(D) * 81
Step: 81
Biv: N
Overflowness wrto loop niter: No-overflow
IV struct:
SSA_NAME: _20
Type: integer(kind=8)
Base: (integer(kind=8)) l0_19(D)
Step: 1
Biv: N
Overflowness wrto loop niter: No-overflow
IV struct:
SSA_NAME: i7_28
Type: integer(kind=4)
Base: l0_19(D) + 1
Step: 1
Biv: Y
Overflowness wrto loop niter: No-overflow
IV struct:
SSA_NAME: vectp.22_46
Type: integer(kind=4) *
Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) *
81) + 9) * 4
Step: 324
Object: (void *) &block
Biv: N
Overflowness wrto loop niter: No-overflow
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
` (3 preceding siblings ...)
2024-05-03 8:41 ` tnfchris at gcc dot gnu.org
@ 2024-05-03 8:44 ` tnfchris at gcc dot gnu.org
2024-05-03 8:45 ` tnfchris at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03 8:44 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 58095
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095&action=edit
exchange2.fppized-good.f90.187t.ivopts
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
` (4 preceding siblings ...)
2024-05-03 8:44 ` tnfchris at gcc dot gnu.org
@ 2024-05-03 8:45 ` tnfchris at gcc dot gnu.org
2024-05-03 9:12 ` rguenth at gcc dot gnu.org
2024-05-13 8:28 ` tnfchris at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03 8:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 58096
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096&action=edit
exchange2.fppized-bad.f90.187t.ivopts
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
` (5 preceding siblings ...)
2024-05-03 8:45 ` tnfchris at gcc dot gnu.org
@ 2024-05-03 9:12 ` rguenth at gcc dot gnu.org
2024-05-13 8:28 ` tnfchris at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-03 9:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Likely
Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
vs.
Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4
where we fail to optimize the outer multiply. It's
((unsigned)((signed)x * 81) + 9) * 4
and likely done by extract_muldiv for the case of (unsigned)x. The trick
would be to promote the inner multiply to unsigned to make the otherwise
profitable transform valid. But best not by enhancing extract_muldiv ...
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
` (6 preceding siblings ...)
2024-05-03 9:12 ` rguenth at gcc dot gnu.org
@ 2024-05-13 8:28 ` tnfchris at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-13 8:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2024-05-13
Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org
--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> Likely
>
> Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
>
> vs.
>
> Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
>
> where we fail to optimize the outer multiply. It's
>
> ((unsigned)((signed)x * 81) + 9) * 4
>
> and likely done by extract_muldiv for the case of (unsigned)x. The trick
> would be to promote the inner multiply to unsigned to make the otherwise
> profitable transform valid. But best not by enhancing extract_muldiv ...
Ah, merci!
Mine then.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-05-13 8:28 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03 5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
2024-05-03 6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
2024-05-03 7:03 ` pinskia at gcc dot gnu.org
2024-05-03 8:09 ` tnfchris at gcc dot gnu.org
2024-05-03 8:41 ` tnfchris at gcc dot gnu.org
2024-05-03 8:44 ` tnfchris at gcc dot gnu.org
2024-05-03 8:45 ` tnfchris at gcc dot gnu.org
2024-05-03 9:12 ` rguenth at gcc dot gnu.org
2024-05-13 8:28 ` tnfchris at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).