[Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains
@ 2024-05-03  5:51 tnfchris at gcc dot gnu.org
  2024-05-03  6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03  5:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

            Bug ID: 114932
           Summary: Improvement in CHREC can give large performance gains
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

With the original fix from PR114074 applied (e.g.
g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but saw
big improvements.

The following testcase:

---
  module brute_force
    integer, parameter :: r=9
     integer  block(r, r, r)
    contains
  subroutine brute
      k = 1
call digits_2(k)
  end
   recursive subroutine digits_2(row)
  integer, intent(in) :: row
  logical OK
     do i1 = 0, 1
      do i2 = 1, 1
          do i3 = 1, 1
           do i4 = 0, 1
                do i5 = 1, select
                     do i6 = 0, 1
                         do i7 = l0, u0
                       select case(1 )
                       case(1)
                           block(:2, 7:, i7) = block(:2, 7:, i7) - 1
                       end select
                            do i8 = 1, 1
                               do i9 = 1, 1
                            if(row == 5) then
                          elseif(OK)then
                                    call digits_2(row + 1)
                                end if
                                end do
                          end do
                       block(:, 1, i7) =   select
                    end do
                    end do
              end do
              end do
           end do
        block = 1
     end do
     block = 1
     block = block0 + select
  end do
 end
  end
---

compiled with: -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer foo.f90

gets vectorized after sra and constprop.  But the final addressing modes are so
complicated that IVopts generates a register offset mode:

  4c:   2f00041d        mvni    v29.2s, #0x0
  50:   fc666842        ldr     d2, [x2, x6]
  54:   fc656841        ldr     d1, [x2, x5]
  58:   fc646840        ldr     d0, [x2, x4]
  5c:   0ebd8442        add     v2.2s, v2.2s, v29.2s
  60:   0ebd8421        add     v1.2s, v1.2s, v29.2s
  64:   0ebd8400        add     v0.2s, v0.2s, v29.2s

which is harder for prefetchers to follow.  When the patch was applied it was
able to correctly lower these to the immediate offset loads that the scalar
code was using:

  38:   2f00041d        mvni    v29.2s, #0x0
  34:   fc594002        ldur    d2, [x0, #-108]
  40:   fc5b8001        ldur    d1, [x0, #-72]
  44:   fc5dc000        ldur    d0, [x0, #-36]
  48:   0ebd8442        add     v2.2s, v2.2s, v29.2s
  4c:   0ebd8421        add     v1.2s, v1.2s, v29.2s
  50:   0ebd8400        add     v0.2s, v0.2s, v29.2s

and also removes all the additional instructions to keep x6,x5 and x4 up to
date.

This gave 10%+ improvements on various workloads.

(ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0
specialization).

I will try to reduce it more, but am filing this so we can keep track and
hopefully fix.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
@ 2024-05-03  6:26 ` rguenth at gcc dot gnu.org
  2024-05-03  7:03 ` pinskia at gcc dot gnu.org
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-03  6:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The change likely made SCEV/IVOPTs "stop" at more convenient places, but we can
only know when there's more detailed analysis.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
  2024-05-03  6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
@ 2024-05-03  7:03 ` pinskia at gcc dot gnu.org
  2024-05-03  8:09 ` tnfchris at gcc dot gnu.org
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-03  7:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu.org

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> which is harder for prefetchers to follow.

This seems like a limitation in the HW prefetcher rather than anything else.
Maybe the cost model for addressing mode should punish base+index if so. Many
HW prefetchers I know of are based on the final VA (or even PA) rather looking
at the instruction to see if it increments or not ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
  2024-05-03  6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
  2024-05-03  7:03 ` pinskia at gcc dot gnu.org
@ 2024-05-03  8:09 ` tnfchris at gcc dot gnu.org
  2024-05-03  8:41 ` tnfchris at gcc dot gnu.org
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> > which is harder for prefetchers to follow.
> 
> This seems like a limitation in the HW prefetcher rather than anything else.
> Maybe the cost model for addressing mode should punish base+index if so.
> Many HW prefetchers I know of are based on the final VA (or even PA) rather
> looking at the instruction to see if it increments or not ...

That was the first thing we tried, and even increasing the cost of
register_offset to something ridiculously high doesn't change a thing.

IVopts thinks it needs to use it and generates:

  _1150 = (voidD.26 *) _1148;
  _1152 = (sizetype) l0_78(D);
  _1154 = _1152 * 324;
  _1156 = _1154 + 216;
  # VUSE <.MEM_421>
  vect__349.614_1418 = MEM <vector(2) integer(kind=4)D.9> [(integer(kind=4)D.9
*)_1150 + _1156 * 1 clique 2 base 0];

Hence the bug report to see what's going on.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-05-03  8:09 ` tnfchris at gcc dot gnu.org
@ 2024-05-03  8:41 ` tnfchris at gcc dot gnu.org
  2024-05-03  8:44 ` tnfchris at gcc dot gnu.org
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03  8:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
reduced more:

---
  module brute_force
    integer, parameter :: r=9
     integer  block(r, r, 0)
    contains
  subroutine brute
     do
      do
          do
           do
                do
                     do
                         do i7 = l0, 1
                       select case(1 )
                       case(1)
                           block(:2, 7:, 1) = block(:2, 7:, i7) - 1
                       end select
                            do i8 = 1, 1
                               do i9 = 1, 1
                            if(1 == 1) then
                                    call digits_20
                                end if
                                end do
                          end do
                    end do
                    end do
              end do
              end do
           end do
     end do
  end do
 end
  end
---

I'll have to stop now till I'm back, but the main difference seems to be in:

good:

<Induction Vars>:
IV struct:
  SSA_NAME:     _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81)
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: Overflow
IV struct:
  SSA_NAME:     _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME:     i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME:     vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)
  Step: 324
  Object:       (void *) &block
  Biv:  N
  Overflowness wrto loop niter: No-overflow

bad:

<Induction Vars>:
IV struct:
  SSA_NAME:     _1
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D) * 81
  Step: 81
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME:     _20
  Type: integer(kind=8)
  Base: (integer(kind=8)) l0_19(D)
  Step: 1
  Biv:  N
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME:     i7_28
  Type: integer(kind=4)
  Base: l0_19(D) + 1
  Step: 1
  Biv:  Y
  Overflowness wrto loop niter: No-overflow
IV struct:
  SSA_NAME:     vectp.22_46
  Type: integer(kind=4) *
  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) *
81) + 9) * 4
  Step: 324
  Object:       (void *) &block
  Biv:  N
  Overflowness wrto loop niter: No-overflow

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-05-03  8:41 ` tnfchris at gcc dot gnu.org
@ 2024-05-03  8:44 ` tnfchris at gcc dot gnu.org
  2024-05-03  8:45 ` tnfchris at gcc dot gnu.org
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03  8:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 58095
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095&action=edit
exchange2.fppized-good.f90.187t.ivopts

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-05-03  8:44 ` tnfchris at gcc dot gnu.org
@ 2024-05-03  8:45 ` tnfchris at gcc dot gnu.org
  2024-05-03  9:12 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-03  8:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Created attachment 58096
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096&action=edit
exchange2.fppized-bad.f90.187t.ivopts

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-05-03  8:45 ` tnfchris at gcc dot gnu.org
@ 2024-05-03  9:12 ` rguenth at gcc dot gnu.org
  2024-05-13  8:28 ` tnfchris at gcc dot gnu.org
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-03  9:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Likely

  Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
324) + 36)

vs.

  Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
* 81) + 9) * 4

where we fail to optimize the outer multiply.  It's

 ((unsigned)((signed)x * 81) + 9) * 4

and likely done by extract_muldiv for the case of (unsigned)x.  The trick
would be to promote the inner multiply to unsigned to make the otherwise
profitable transform valid.  But best not by enhancing extract_muldiv ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-05-03  9:12 ` rguenth at gcc dot gnu.org
@ 2024-05-13  8:28 ` tnfchris at gcc dot gnu.org
  2024-06-05  9:42 ` [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing tnfchris at gcc dot gnu.org
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-13  8:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2024-05-13
           Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org

--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> Likely
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) *
> 324) + 36)
> 
> vs.
> 
>   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D)
> * 81) + 9) * 4
> 
> where we fail to optimize the outer multiply.  It's
> 
>  ((unsigned)((signed)x * 81) + 9) * 4
> 
> and likely done by extract_muldiv for the case of (unsigned)x.  The trick
> would be to promote the inner multiply to unsigned to make the otherwise
> profitable transform valid.  But best not by enhancing extract_muldiv ...

Ah, merci!

Mine then.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-05-13  8:28 ` tnfchris at gcc dot gnu.org
@ 2024-06-05  9:42 ` tnfchris at gcc dot gnu.org
  2024-06-05 10:23 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-05  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
It's taken me a bit of time to track down all the reasons for the speedup with
the earlier patch.

This comes from two parts:

1. Signed IVs don't get simplified.  Due to possible UB with signed overflows
gimple expressions don't get simplified when the type is signed.

However for addressing modes it doesn't matter as simplifying the constants any
potential overflow can still happen.  Secondly most architectures say you can
never reach the full address space range anyway.  Those that due (like those
that offer baremetal variants like Arm and AArch64) explicitly specify that
overflow is defined as wrapping around.  That means that IVs for their use in
IV opts should be save to simplify as if they were unsigned.

I have a patch that during the creation of IV candidates folds them to unsigned
and then folds them back to their original signed types.  This maintains all
the original overflow analysis and the correct typing in gimple.

2. The second problem is that due to Fortran not having unsigned types, the
front-end generates a signed IV.  Some optimizations as they work can convert
these to unsigned due to folding, e.g. extract_muldiv is one place where this
is done.

This can make us end up having the same IV as both signed and unsigned, as is
the case here:

<Invariant Expressions>:                                                       
                                                                               
                                                                               
                             inv_expr 1:     stride.3_27 * 4                   
                                                                               
                                                                               
                                                          inv_expr 2:    
(unsigned long) stride.3_27 * 4       

These end up being used in the same group:

Group 1:                                                                       
                                                                               
                                                                               
                               cand  cost    compl.  inv.expr.       inv.vars  
                                                                               
                                                                               
                                                            1     0       0    
  NIL;    6                                                                    
                                                                               
                                                                               
         2     0       0       NIL;    6                                       
                                                                               
                                                                               
                                      3     0       0       NIL;    6          
                                                                               
                                                                               
                                                                   4     0     
 0       NIL;    6                     

which ends up with IV opts picking the signed and unsigned IVs:

Improved to:
  cost: 24 (complexity 3)
  reg_cost: 9
  cand_cost: 15
  cand_group_cost: 0 (complexity 3)
  candidates: 1, 6, 8
   group:0 --> iv_cand:6, cost=(0,1)
   group:1 --> iv_cand:1, cost=(0,0)
   group:2 --> iv_cand:8, cost=(0,1)
   group:3 --> iv_cand:8, cost=(0,1)
  invariant variables: 6
  invariant expressions: 1, 2

and so generates the same IV as both signed and unsigned:

;;   basic block 21, loop depth 3, count 214748368 (estimated locally, freq
58.2545), maybe hot                                                            
                                                                               
                                 ;;    prev block 28, next block 31, flags:
(NEW, REACHABLE, VISITED)
;;    pred:       28 [always]  count:23622320 (estimated locally, freq 6.4080)
(FALLTHRU,EXECUTABLE)                                                          
                                                                               
                              ;;                25 [always]  count:191126046
(estimated locally, freq 51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE)
  # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)>
  # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)>
  # ivtmp.26_51 = PHI <ivtmp.26_55(28), ivtmp.26_72(25)>
  # ivtmp.28_90 = PHI <ivtmp.28_99(28), ivtmp.28_98(25)>

...

;;   basic block 24, loop depth 3, count 214748366 (estimated locally, freq
58.2545), maybe hot                                                            
                                                                               
                                 ;;    prev block 22, next block 25, flags:
(NEW, REACHABLE, VISITED)                                                      
                                                                               
                                                                  ;;    pred:  
    22 [always]  count:95443719 (estimated locally, freq 25.8909) (FALLTHRU)   
                                                                               
                                                                               
               ;;                21 [33.3% (guessed)]  count:71582790
(estimated locally, freq 19.4182) (TRUE_VALUE,EXECUTABLE)                      
                                                                               
                                                      ;;                31
[33.3% (guessed)]  count:47721860 (estimated locally, freq 12.9455)
(TRUE_VALUE,EXECUTABLE)                                                        
                                                                               
                      # .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)>  
                                                                               
                                                                               
                                                   ivtmp.22_82 = ivtmp.22_41 +
1;                                                                             
                                                                               
                                                                               
 ivtmp.26_72 = ivtmp.26_51 + _80;                                              
                                                                               
                                                                               
                              ivtmp.28_98 = ivtmp.28_90 + _39;  

These two IVs are always used as unsigned, so IV ops generates:

  _73 = stride.3_27 * 4;
  _80 = (unsigned long) _73;
  _54 = (unsigned long) stride.3_27;
  _39 = _54 * 4;

Which means that in e.g. exchange2 we generate a lot of duplicate code.
I'm unsure yet how to fix this.  I think I need to know how the IV values are
used.

Given that the signed IV is used as unsigned they should be the same.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-06-05  9:42 ` [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing tnfchris at gcc dot gnu.org
@ 2024-06-05 10:23 ` rguenth at gcc dot gnu.org
  2024-06-05 19:02 ` tnfchris at gcc dot gnu.org
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-05 10:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think the question is why IVOPTs ends up using both the signed and unsigned
variant of the same IV instead of expressing all uses of both with one IV?

That's where I'd look into.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-06-05 10:23 ` rguenth at gcc dot gnu.org
@ 2024-06-05 19:02 ` tnfchris at gcc dot gnu.org
  2024-06-06  6:17 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-05 19:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> I think the question is why IVOPTs ends up using both the signed and
> unsigned variant of the same IV instead of expressing all uses of both with
> one IV?
> 
> That's where I'd look into.

It looks like this is because of a subtle difference in the expressions.

In get_loop_invariant_expr IVOPTs first tries to strip away all casts with
STRIP_NOPS.

The first expression is (unsigned long) (stride.3_27 * 4) and the second
expression is ((unsigned long) stride.3_27) * 4 (The pretty printing here is
pretty bad...)

So the first one becomes:
  (unsigned long) (stride.3_27 * 4) -> stride.3_27 * 4

and second one:
  ((unsigned long) stride.3_27) * 4 -> ((unsigned long) stride.3_27) * 4

since we don't care about overflow here, it looks like the stripping should
be recursive as long as it's a NOP expression between two integral types.

That would get them to hash to the same IV expression.  Trying now..

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-06-05 19:02 ` tnfchris at gcc dot gnu.org
@ 2024-06-06  6:17 ` rguenther at suse dot de
  2024-06-06  6:40 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2024-06-06  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 5 Jun 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> 
> --- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #10)
> > I think the question is why IVOPTs ends up using both the signed and
> > unsigned variant of the same IV instead of expressing all uses of both with
> > one IV?
> > 
> > That's where I'd look into.
> 
> It looks like this is because of a subtle difference in the expressions.
> 
> In get_loop_invariant_expr IVOPTs first tries to strip away all casts with
> STRIP_NOPS.
> 
> The first expression is (unsigned long) (stride.3_27 * 4) and the second
> expression is ((unsigned long) stride.3_27) * 4 (The pretty printing here is
> pretty bad...)
> 
> So the first one becomes:
>   (unsigned long) (stride.3_27 * 4) -> stride.3_27 * 4
> 
> and second one:
>   ((unsigned long) stride.3_27) * 4 -> ((unsigned long) stride.3_27) * 4
> 
> since we don't care about overflow here, it looks like the stripping should
> be recursive as long as it's a NOP expression between two integral types.
> 
> That would get them to hash to the same IV expression.  Trying now..

Note tree-affine is a tool that's used for this kind of "weak" equalities.
Convert both to affine, subtract them and if that's zero they are equal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-06-06  6:17 ` rguenther at suse dot de
@ 2024-06-06  6:40 ` tnfchris at gcc dot gnu.org
  2024-06-06  7:55 ` rguenther at suse dot de
  2024-06-06  8:01 ` tnfchris at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-06  6:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #13 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #12) 
> > since we don't care about overflow here, it looks like the stripping should
> > be recursive as long as it's a NOP expression between two integral types.
> > 
> > That would get them to hash to the same IV expression.  Trying now..
> 
> Note tree-affine is a tool that's used for this kind of "weak" equalities.
> Convert both to affine, subtract them and if that's zero they are equal.

Hmm that's useful, though in this case this becomes the actual expression that
IVOpts uses.

For instance this is done in alloc_iv and add_iv_candidate_for_use when
determining the uses for the IV.

It looks like it's trying to force a canonical representation with as minimum
casting as possible.

would the "affine"'ed tree be safe to use for this context?

What I've done currently is make a STRIP_ALL_NOPS that recursively strips NOPs
for PLUS/MULT/MINUS.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-06-06  6:40 ` tnfchris at gcc dot gnu.org
@ 2024-06-06  7:55 ` rguenther at suse dot de
  2024-06-06  8:01 ` tnfchris at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: rguenther at suse dot de @ 2024-06-06  7:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 6 Jun 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> 
> --- Comment #13 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #12) 
> > > since we don't care about overflow here, it looks like the stripping should
> > > be recursive as long as it's a NOP expression between two integral types.
> > > 
> > > That would get them to hash to the same IV expression.  Trying now..
> > 
> > Note tree-affine is a tool that's used for this kind of "weak" equalities.
> > Convert both to affine, subtract them and if that's zero they are equal.
> 
> Hmm that's useful, though in this case this becomes the actual expression that
> IVOpts uses.
> 
> For instance this is done in alloc_iv and add_iv_candidate_for_use when
> determining the uses for the IV.
> 
> It looks like it's trying to force a canonical representation with as minimum
> casting as possible.
> 
> would the "affine"'ed tree be safe to use for this context?

No, I'd use that just for the comparison.

> What I've done currently is make a STRIP_ALL_NOPS that recursively strips NOPs
> for PLUS/MULT/MINUS.

But the stripped expression isn't necessarily valid to use either because
of possible undefined overflow.  It's probably safe to pick any of the
existing expressions (all are evaluated at each iteration), but if you
strip all NOPs from all of them you might end up with new undefined
behavior.

At least if that stripped expression is inserted somewhere or other
new expressions are built upon it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
  2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2024-06-06  7:55 ` rguenther at suse dot de
@ 2024-06-06  8:01 ` tnfchris at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-06  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #14)
> On Thu, 6 Jun 2024, tnfchris at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> > 
> > --- Comment #13 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #12) 
> > > > since we don't care about overflow here, it looks like the stripping should
> > > > be recursive as long as it's a NOP expression between two integral types.
> > > > 
> > > > That would get them to hash to the same IV expression.  Trying now..
> > > 
> > > Note tree-affine is a tool that's used for this kind of "weak" equalities.
> > > Convert both to affine, subtract them and if that's zero they are equal.
> > 
> > Hmm that's useful, though in this case this becomes the actual expression that
> > IVOpts uses.
> > 
> > For instance this is done in alloc_iv and add_iv_candidate_for_use when
> > determining the uses for the IV.
> > 
> > It looks like it's trying to force a canonical representation with as minimum
> > casting as possible.
> > 
> > would the "affine"'ed tree be safe to use for this context?
> 
> No, I'd use that just for the comparison.
> 
> > What I've done currently is make a STRIP_ALL_NOPS that recursively strips NOPs
> > for PLUS/MULT/MINUS.
> 
> But the stripped expression isn't necessarily valid to use either because
> of possible undefined overflow.  It's probably safe to pick any of the
> existing expressions (all are evaluated at each iteration), but if you
> strip all NOPs from all of them you might end up with new undefined
> behavior.
> 
> At least if that stripped expression is inserted somewhere or other
> new expressions are built upon it.

does overflow matter for addressing mode though? if you have undefined behavior
in your address space then your program would have crashed anyway no?

In this case IVOpts would have already stripped away the outer NOPs so building
upon this one could also cause undefined overflow can it not? i.e. if the IV
was
((signed) unsigned_calculation.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-06-06  8:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03  5:51 [Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains tnfchris at gcc dot gnu.org
2024-05-03  6:26 ` [Bug tree-optimization/114932] " rguenth at gcc dot gnu.org
2024-05-03  7:03 ` pinskia at gcc dot gnu.org
2024-05-03  8:09 ` tnfchris at gcc dot gnu.org
2024-05-03  8:41 ` tnfchris at gcc dot gnu.org
2024-05-03  8:44 ` tnfchris at gcc dot gnu.org
2024-05-03  8:45 ` tnfchris at gcc dot gnu.org
2024-05-03  9:12 ` rguenth at gcc dot gnu.org
2024-05-13  8:28 ` tnfchris at gcc dot gnu.org
2024-06-05  9:42 ` [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing tnfchris at gcc dot gnu.org
2024-06-05 10:23 ` rguenth at gcc dot gnu.org
2024-06-05 19:02 ` tnfchris at gcc dot gnu.org
2024-06-06  6:17 ` rguenther at suse dot de
2024-06-06  6:40 ` tnfchris at gcc dot gnu.org
2024-06-06  7:55 ` rguenther at suse dot de
2024-06-06  8:01 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).