public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/99068] New: Missed PowerPC lhau optimization
@ 2021-02-11  0:13 brian.grayson at sifive dot com
  2021-02-11  0:57 ` [Bug tree-optimization/99068] " pinskia at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-11  0:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

            Bug ID: 99068
           Summary: Missed PowerPC lhau optimization
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: brian.grayson at sifive dot com
  Target Milestone: ---

(This relates to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99067 but is a
distinct target optimization bug).

This code:

int16_t a[1000];
int64_t N = 100;
int found_zero_ptr(int *a, int N) {
  for (int16_t* p = &a[0]; p <= &a[N]; p++) {
    if (*p == 0) return 1;
  }
  return 0;
}

generates this PowerPC assembly under -O3:
...
.L15:
  bgt 7,.L12
.L11:
  lha 10,0(9)
  addi 9,9,2
  cmpld 7,9,8
  cmpwi 0,10,0
  bne 0,.L15
...

In a minor variation of this code, the lha and addi are merged into an lhau.
Why does gcc not do that same merge in the code shown here?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
@ 2021-02-11  0:57 ` pinskia at gcc dot gnu.org
  2021-02-12 16:23 ` segher at gcc dot gnu.org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-02-11  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
          Component|target                      |tree-optimization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
  2021-02-11  0:57 ` [Bug tree-optimization/99068] " pinskia at gcc dot gnu.org
@ 2021-02-12 16:23 ` segher at gcc dot gnu.org
  2021-02-12 18:04 ` brian.grayson at sifive dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-12 16:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID
                 CC|                            |segher at gcc dot gnu.org

--- Comment #1 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Because it would be incorrect?  lhau is pre-modify (like all update
form instructions).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
  2021-02-11  0:57 ` [Bug tree-optimization/99068] " pinskia at gcc dot gnu.org
  2021-02-12 16:23 ` segher at gcc dot gnu.org
@ 2021-02-12 18:04 ` brian.grayson at sifive dot com
  2021-02-12 23:28 ` segher at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-12 18:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Brian Grayson <brian.grayson at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|INVALID                     |---
             Status|RESOLVED                    |UNCONFIRMED

--- Comment #2 from Brian Grayson <brian.grayson at sifive dot com> ---
(In reply to Segher Boessenkool from comment #1)
> Because it would be incorrect?  lhau is pre-modify (like all update
> form instructions).

That's easily handled by pre-decrementing r9 by 2, as is standard in such
update-form optimizations, right?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (2 preceding siblings ...)
  2021-02-12 18:04 ` brian.grayson at sifive dot com
@ 2021-02-12 23:28 ` segher at gcc dot gnu.org
  2021-02-13  2:11 ` brian.grayson at sifive dot com
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-12 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #3 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Then you get

addi 9,9,-2
lhau 10,2(9)
addi 9,9,2

which is worse than just

lha 10,0(9)
addi 9,9,2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (3 preceding siblings ...)
  2021-02-12 23:28 ` segher at gcc dot gnu.org
@ 2021-02-13  2:11 ` brian.grayson at sifive dot com
  2021-02-16 16:11 ` segher at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-13  2:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Brian Grayson <brian.grayson at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|INVALID                     |---
             Status|RESOLVED                    |UNCONFIRMED

--- Comment #4 from Brian Grayson <brian.grayson at sifive dot com> ---
(In reply to Segher Boessenkool from comment #3)
> Then you get
> 
> addi 9,9,-2
> lhau 10,2(9)
> addi 9,9,2
> 
> which is worse than just
> 
> lha 10,0(9)
> addi 9,9,2

Why is the second addi needed, in your example? And note that if a
pre-decrement "addi 9,9,-2" is needed to pre-bias the pointer, it is done once
outside the loop, and not in every iteration of the loop.

When I alter the code some (to loop via index, instead of via pointer), the
lhau gets emitted, and I get this:

.LC0:
  .quad a-2
...
  addis 8,2,.LC0@toc@ha
  ld 8,.LC0@toc@l(8)
...
.L8:
  blt 7,.L4
.L3:
  lhau 9,2(8)  <-- r8 is loaded from a TOC entry that is pre-decremented by 2
  addi 10,10,1
  cmpd 7,7,10
  cmpwi 0,9,0
  bne 0,.L8
  li 3,1
  blr

The TOC entry is already pre-decremented, so that's not a concern. And there is
no need for an addi because the lhau does the job.

I'm not sure I see your reason for closing the bug. GCC is capable of emitting
the code above, and it's correct, without any extra instructions. I'm only
asking gcc to emit the same code when it is coded via pointers, instead of only
when coded via indexing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (4 preceding siblings ...)
  2021-02-13  2:11 ` brian.grayson at sifive dot com
@ 2021-02-16 16:11 ` segher at gcc dot gnu.org
  2021-02-16 16:15 ` segher at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-16 16:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |INVALID

--- Comment #5 from Segher Boessenkool <segher at gcc dot gnu.org> ---
a) The code does not compile (it is not complete source code, some includes
are needed);
b) When you fix that, the compiler will tell you it is invalid code (you
shadow "a" with an incompatible type).
c) You do not say which target you used.

So let's try this (on powerpc64-linux, -O3 since you want that, everything
else default):

===
#include <stdint.h>

int found_zero_ptr(int16_t *a, int N)
{
        for (int16_t *p = a; p < a + N; p++)
                if (*p == 0)
                        return 1;
        return 0;
}
===

which as core has

===
.L21:
        lha 9,2(3)
        addi 3,3,4
        cmpld 7,3,4
        cmpwi 0,9,0
        beq 0,.L5
        bge 7,.L4
.L3:
        lha 9,0(3)
        cmpwi 0,9,0
        bne 0,.L21
===

You cannot use lha in this without making the generated code worse.

(Marking this invalid *again*.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (5 preceding siblings ...)
  2021-02-16 16:11 ` segher at gcc dot gnu.org
@ 2021-02-16 16:15 ` segher at gcc dot gnu.org
  2021-02-16 18:34 ` brian.grayson at sifive dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-16 16:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

--- Comment #6 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to Brian Grayson from comment #4)
> (In reply to Segher Boessenkool from comment #3)
> > Then you get
> > 
> > addi 9,9,-2
> > lhau 10,2(9)
> > addi 9,9,2
> > 
> > which is worse than just
> > 
> > lha 10,0(9)
> > addi 9,9,2
> 
> Why is the second addi needed, in your example?

Because I typoed it.

> And note that if a
> pre-decrement "addi 9,9,-2" is needed to pre-bias the pointer, it is done
> once outside the loop, and not in every iteration of the loop.

You cannot do that without changing the loop structure.  There are various
non-trivial other paths into and out of the loop body.

Since ivopts has decided to not use pre-increment here (because it is more
expensive than not using it), we do not use it.

Why do you think having a lhau is better?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (6 preceding siblings ...)
  2021-02-16 16:15 ` segher at gcc dot gnu.org
@ 2021-02-16 18:34 ` brian.grayson at sifive dot com
  2021-02-16 22:54 ` segher at gcc dot gnu.org
  2021-02-17  0:33 ` brian.grayson at sifive dot com
  9 siblings, 0 replies; 11+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-16 18:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

--- Comment #7 from Brian Grayson <brian.grayson at sifive dot com> ---
A single lhau instruction is better than two instructions (lha + addi) for many
reasons. Are there reasons that you feel a two-instruction sequence of lha+addi
is *superior* to just an lhau?

On all PowerPC implementations, reducing the loop size by one improves the odds
that the entire loop fits within a cache line, reducing the fetch bubbles that
might otherwise occur.

On all PowerPC implementations, saving one instruction reduces overall
instruction-cache pressure, especially if this construct or other
lhau-compatible situations occur multiple times within a function for looping
over different arrays etc.

On all PowerPC implementations, decoding a single instruction should be less
power than decoding two, even in the presence of cracking.

On some PowerPC implementations, the lhau will crack into two micro-ops (either
at decode, or possibly later in the pipe -- I've worked on implementations that
have done both approaches), and still use two execution units (LSU + ALU), but
on some others, it will go to only one unit and merely provide two results from
the LSU. Either way, lhau is no worse than lha+addi, and sometimes better.

There is a potential issue on some implementations, if they cannot handle
cracking and decoding further instructions in a single cycle, but that should
be controlled by an -march/-mcpu/-mtune flag, not as a blanket across all past
and future implementations, as some existing implementations do *not* pay a
penalty when decoding an update-form instruction.

If the concern is over lha instead of lhz, as some implementations may crack
lha into lhz+extsh, that is a distinct reason, but the same "missed
optimization" occurs if I change the types to uint16_t, where the lha crack
concern disappears: gcc still emits lhz/addi instead of lhzu.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (7 preceding siblings ...)
  2021-02-16 18:34 ` brian.grayson at sifive dot com
@ 2021-02-16 22:54 ` segher at gcc dot gnu.org
  2021-02-17  0:33 ` brian.grayson at sifive dot com
  9 siblings, 0 replies; 11+ messages in thread
From: segher at gcc dot gnu.org @ 2021-02-16 22:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

--- Comment #8 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Using update form instructions constrains register allocation and scheduling.
It is *not* always a good idea.

That is one of the reasons why we currently use update form instructions only
when insns just happen to land close (in the same basic block).  See
auto_inc_dec.c .

We also do some work to make it more likely that loops will use these
constructs.  We don't go out of our way to use update form insns at the cost
of everything else.  You will see them more often if you use -Os or -O1.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug tree-optimization/99068] Missed PowerPC lhau optimization
  2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
                   ` (8 preceding siblings ...)
  2021-02-16 22:54 ` segher at gcc dot gnu.org
@ 2021-02-17  0:33 ` brian.grayson at sifive dot com
  9 siblings, 0 replies; 11+ messages in thread
From: brian.grayson at sifive dot com @ 2021-02-17  0:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

--- Comment #9 from Brian Grayson <brian.grayson at sifive dot com> ---
If I understand correctly, you're saying that it is sometimes preferred for gcc
to avoid update form, but even when the load and addi are next to each other
it's possible to use update form, like in the example assembly discussed in
this bug? It seems the situation here is exactly the poster child for when it
*should* generate an lhau -- same register being used, contiguous instructions,
no change in register pressure by using lhau. In fact, it seems a peephole
optimizer would do exactly that type of change.

If it's preferred to not use update form even when one has contiguous
instructions and no register pressure change, then I guess it's a bug that gcc
*does* generate an lhau under -O3 for found_zero(). Or to state it differently,
I don't understand why gcc generates code differently for these two cases
w.r.t. lhau, as they are both the same underlying computation and gcc knows
that:

#include <stdint.h>
int16_t a[1000];
int64_t N = 100;
int found_zero() { // Generates an lhau.
  for (int64_t i = 0; i <= N; i++) { if (a[i] == 0) return 1; }
  return 0;
}
int found_zero_ptr() { // Does not generate an lhau.
  for (int16_t* p = &a[0]; p <= &a[N]; p++) { if (*p == 0) return 1; }
  return 0;
}

This emits:

  .file "stub.c"
  .machine power4
  .section  ".text"
  .section  ".toc","aw"
  .align 3
.LC0:
  .quad a-2
  .section  ".text"
  .align 2
  .p2align 4,,15
  .globl found_zero
  .section  ".opd","aw"
  .align 3
found_zero:
  .quad .L.found_zero,.TOC.@tocbase,0
  .previous
  .type found_zero, @function
.L.found_zero:
.LFB0:
  .cfi_startproc
  addis 7,2,.LANCHOR0@toc@ha
  ld 7,.LANCHOR0@toc@l(7)
  cmpdi 0,7,0
  blt 0,.L4
  addis 8,2,.LC0@toc@ha
  ld 8,.LC0@toc@l(8)
  li 10,0
  b .L3
  .p2align 4,,15
.L8:
  bgt 7,.L4
.L3:
  lhau 9,2(8)   <--- lhau generated here, instead of an addi r8,r8,2
  addi 10,10,1
  cmpd 7,10,7
  cmpwi 0,9,0
  bne 0,.L8
  li 3,1
  blr
  .p2align 4,,15
.L4:
  li 3,0
  blr
  .long 0
  .byte 0,0,0,0,0,0,0,0
  .cfi_endproc
.LFE0:
  .size found_zero,.-.L.found_zero
  .section  ".toc","aw"
.LC1:
  .quad a
  .section  ".text"
  .align 2
  .p2align 4,,15
  .globl found_zero_ptr
  .section  ".opd","aw"
  .align 3
found_zero_ptr:
  .quad .L.found_zero_ptr,.TOC.@tocbase,0
  .previous
  .type found_zero_ptr, @function
.L.found_zero_ptr:
.LFB1:
  .cfi_startproc
  addis 8,2,.LANCHOR0@toc@ha
  ld 8,.LANCHOR0@toc@l(8)
  addis 9,2,.LC1@toc@ha
  ld 9,.LC1@toc@l(9)
  sldi 8,8,1
  add 8,8,9
  cmpld 0,8,9
  bge 0,.L11
  b .L12
  .p2align 4,,15
.L15:
  bgt 7,.L12
.L11:
  lha 10,0(9)   <--- No lhau here, even though r9 is overwritten as 
  addi 9,9,2    <--- source and dest by the addi, and they are contiguous
  cmpld 7,9,8
  cmpwi 0,10,0
  bne 0,.L15
  li 3,1
  blr
  .p2align 4,,15
.L12:
  li 3,0
  blr
  .long 0
  .byte 0,0,0,0,0,0,0,0
  .cfi_endproc
.LFE1:
  .size found_zero_ptr,.-.L.found_zero_ptr

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-17  0:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-11  0:13 [Bug target/99068] New: Missed PowerPC lhau optimization brian.grayson at sifive dot com
2021-02-11  0:57 ` [Bug tree-optimization/99068] " pinskia at gcc dot gnu.org
2021-02-12 16:23 ` segher at gcc dot gnu.org
2021-02-12 18:04 ` brian.grayson at sifive dot com
2021-02-12 23:28 ` segher at gcc dot gnu.org
2021-02-13  2:11 ` brian.grayson at sifive dot com
2021-02-16 16:11 ` segher at gcc dot gnu.org
2021-02-16 16:15 ` segher at gcc dot gnu.org
2021-02-16 18:34 ` brian.grayson at sifive dot com
2021-02-16 22:54 ` segher at gcc dot gnu.org
2021-02-17  0:33 ` brian.grayson at sifive dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).