public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/95704] New: PPC: int128 shifts should be implemented branchless
@ 2020-06-16 17:14 jens.seifert at de dot ibm.com
  2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-16 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

            Bug ID: 95704
           Summary: PPC: int128 shifts should be implemented branchless
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 48741
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit
input with branchless 128-bit shifts

PowerPC processors don't like branches and branch mispredicts lead to large
overhead.

shift left/right unsigned __in128 can be implemented in 8 instructions which
can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency
on Power 7 and 8.
shift right algebraic __int128 can be implemented in 10 instructions.
Overall comparable in latency of the branching code.

In attached file you find the branch less implementations in C. And I know that
this is using undefined behavior. But the resulting assembly is the interesting
part. 

The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not
necessary.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
@ 2020-06-16 17:15 ` jens.seifert at de dot ibm.com
  2020-06-17 12:56 ` segher at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-16 17:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #1 from Jens Seifert <jens.seifert at de dot ibm.com> ---
Created attachment 48742
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48742&action=edit
assembly

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
  2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
@ 2020-06-17 12:56 ` segher at gcc dot gnu.org
  2020-06-17 13:06 ` jens.seifert at de dot ibm.com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 12:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

Segher Boessenkool <segher at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |segher at gcc dot gnu.org
   Last reconfirmed|                            |2020-06-17

--- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to Jens Seifert from comment #0)
> PowerPC processors don't like branches and branch mispredicts lead to large
> overhead.

While that is of course true, the situation isn't worse than on
other CPUs.

The situation here is exactly analogous to 64-bit shifts with -m32.

Fixed distance shifts (and rotates) generate pretty much ideal code
already (sometimes it could save a "mr" insn, by reordering more --
that is because the rl*imi insns use a register as both input and
output).

> shift left/right unsigned __in128 can be implemented in 8 instructions which
> can be processed on 2 pipelines almost in parallel leading to ~5 cycle
> latency on Power 7 and 8.

> shift right algebraic __int128 can be implemented in 10 instructions.
> Overall comparable in latency of the branching code.

This can be done better, using the fact that shifts over 64..127
bits are defined just fine for 64-bit power shift insns.

> The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also
> not necessary.

I see no rldicl?

Confirmed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
  2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
  2020-06-17 12:56 ` segher at gcc dot gnu.org
@ 2020-06-17 13:06 ` jens.seifert at de dot ibm.com
  2020-06-17 14:53 ` segher at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-17 13:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #3 from Jens Seifert <jens.seifert at de dot ibm.com> ---
GCC 8.3 generates:
_Z3shloy:
.LFB0:
        .cfi_startproc
        addi 9,5,-64
        cmpwi 7,9,0
        blt 7,.L2
        sld 4,3,9
        li 3,0
        blr
        .p2align 4,,15
.L2:
        srdi 9,3,1
        subfic 10,5,63
        sld 4,4,5
        srd 9,9,10
        sld 3,3,5
        or 4,9,4
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
        .cfi_endproc

8 instructions if taking L2. The branch free code I propsed:

_Z15shl_branch_lessoy:
.LFB1:
        .cfi_startproc
        rldicl 5,5,0,32
        subfic 9,5,64
        addi 10,5,-64
        sld 10,3,10
        srd 9,3,9
        sld 4,4,5
        or 9,9,10
        or 4,9,4
        sld 3,3,5
        blr

8 instructions no branch. Almost everything can be executed in parallel.

rldicl 5,5,0,32 gets added by gcc, which is not necessary.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
                   ` (2 preceding siblings ...)
  2020-06-17 13:06 ` jens.seifert at de dot ibm.com
@ 2020-06-17 14:53 ` segher at gcc dot gnu.org
  2020-06-17 15:28 ` jens.seifert at de dot ibm.com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> ---
It no longer generates that rldicl in GCC 9 (or GCC 10).

You do get straight-line code already if you use -mcpu=power9, btw
(isel; and not totally awful code, but it isn't super of course).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
                   ` (3 preceding siblings ...)
  2020-06-17 14:53 ` segher at gcc dot gnu.org
@ 2020-06-17 15:28 ` jens.seifert at de dot ibm.com
  2020-06-17 16:56 ` segher at gcc dot gnu.org
  2021-05-30 22:49 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-17 15:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #5 from Jens Seifert <jens.seifert at de dot ibm.com> ---
Power9 code is branchfree but not good at all.

_Z3shloy:
.LFB0:
        .cfi_startproc
        addi 8,5,-64
        subfic 6,5,63
        srdi 10,3,1
        li 7,0
        sld 4,4,5
        sld 5,3,5
        cmpwi 7,8,0
        srd 10,10,6
        sld 3,3,8
        or 4,10,4
        isel 5,5,7,28
        isel 4,4,3,28
        mr 3,5
        blr

13 instructions.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
                   ` (4 preceding siblings ...)
  2020-06-17 15:28 ` jens.seifert at de dot ibm.com
@ 2020-06-17 16:56 ` segher at gcc dot gnu.org
  2021-05-30 22:49 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 16:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #6 from Segher Boessenkool <segher at gcc dot gnu.org> ---
13 insns, but the longest chain is 4.  As I said, not totally awful, and
much better than the branchy code (which does not predict well, for more
general inputs anyway).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/95704] PPC: int128 shifts should be implemented branchless
  2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
                   ` (5 preceding siblings ...)
  2020-06-17 16:56 ` segher at gcc dot gnu.org
@ 2021-05-30 22:49 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-05-30 22:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-30 22:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
2020-06-17 12:56 ` segher at gcc dot gnu.org
2020-06-17 13:06 ` jens.seifert at de dot ibm.com
2020-06-17 14:53 ` segher at gcc dot gnu.org
2020-06-17 15:28 ` jens.seifert at de dot ibm.com
2020-06-17 16:56 ` segher at gcc dot gnu.org
2021-05-30 22:49 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).