public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/95704] New: PPC: int128 shifts should be implemented branchless
@ 2020-06-16 17:14 jens.seifert at de dot ibm.com
2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-16 17:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
Bug ID: 95704
Summary: PPC: int128 shifts should be implemented branchless
Product: gcc
Version: 8.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: jens.seifert at de dot ibm.com
Target Milestone: ---
Created attachment 48741
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit
input with branchless 128-bit shifts
PowerPC processors don't like branches and branch mispredicts lead to large
overhead.
shift left/right unsigned __in128 can be implemented in 8 instructions which
can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency
on Power 7 and 8.
shift right algebraic __int128 can be implemented in 10 instructions.
Overall comparable in latency of the branching code.
In attached file you find the branch less implementations in C. And I know that
this is using undefined behavior. But the resulting assembly is the interesting
part.
The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not
necessary.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
@ 2020-06-16 17:15 ` jens.seifert at de dot ibm.com
2020-06-17 12:56 ` segher at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-16 17:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
--- Comment #1 from Jens Seifert <jens.seifert at de dot ibm.com> ---
Created attachment 48742
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48742&action=edit
assembly
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
@ 2020-06-17 12:56 ` segher at gcc dot gnu.org
2020-06-17 13:06 ` jens.seifert at de dot ibm.com
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 12:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
Segher Boessenkool <segher at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
CC| |segher at gcc dot gnu.org
Last reconfirmed| |2020-06-17
--- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to Jens Seifert from comment #0)
> PowerPC processors don't like branches and branch mispredicts lead to large
> overhead.
While that is of course true, the situation isn't worse than on
other CPUs.
The situation here is exactly analogous to 64-bit shifts with -m32.
Fixed distance shifts (and rotates) generate pretty much ideal code
already (sometimes it could save a "mr" insn, by reordering more --
that is because the rl*imi insns use a register as both input and
output).
> shift left/right unsigned __in128 can be implemented in 8 instructions which
> can be processed on 2 pipelines almost in parallel leading to ~5 cycle
> latency on Power 7 and 8.
> shift right algebraic __int128 can be implemented in 10 instructions.
> Overall comparable in latency of the branching code.
This can be done better, using the fact that shifts over 64..127
bits are defined just fine for 64-bit power shift insns.
> The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also
> not necessary.
I see no rldicl?
Confirmed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
2020-06-17 12:56 ` segher at gcc dot gnu.org
@ 2020-06-17 13:06 ` jens.seifert at de dot ibm.com
2020-06-17 14:53 ` segher at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-17 13:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
--- Comment #3 from Jens Seifert <jens.seifert at de dot ibm.com> ---
GCC 8.3 generates:
_Z3shloy:
.LFB0:
.cfi_startproc
addi 9,5,-64
cmpwi 7,9,0
blt 7,.L2
sld 4,3,9
li 3,0
blr
.p2align 4,,15
.L2:
srdi 9,3,1
subfic 10,5,63
sld 4,4,5
srd 9,9,10
sld 3,3,5
or 4,9,4
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc
8 instructions if taking L2. The branch free code I propsed:
_Z15shl_branch_lessoy:
.LFB1:
.cfi_startproc
rldicl 5,5,0,32
subfic 9,5,64
addi 10,5,-64
sld 10,3,10
srd 9,3,9
sld 4,4,5
or 9,9,10
or 4,9,4
sld 3,3,5
blr
8 instructions no branch. Almost everything can be executed in parallel.
rldicl 5,5,0,32 gets added by gcc, which is not necessary.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
` (2 preceding siblings ...)
2020-06-17 13:06 ` jens.seifert at de dot ibm.com
@ 2020-06-17 14:53 ` segher at gcc dot gnu.org
2020-06-17 15:28 ` jens.seifert at de dot ibm.com
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 14:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> ---
It no longer generates that rldicl in GCC 9 (or GCC 10).
You do get straight-line code already if you use -mcpu=power9, btw
(isel; and not totally awful code, but it isn't super of course).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
` (3 preceding siblings ...)
2020-06-17 14:53 ` segher at gcc dot gnu.org
@ 2020-06-17 15:28 ` jens.seifert at de dot ibm.com
2020-06-17 16:56 ` segher at gcc dot gnu.org
2021-05-30 22:49 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: jens.seifert at de dot ibm.com @ 2020-06-17 15:28 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
--- Comment #5 from Jens Seifert <jens.seifert at de dot ibm.com> ---
Power9 code is branchfree but not good at all.
_Z3shloy:
.LFB0:
.cfi_startproc
addi 8,5,-64
subfic 6,5,63
srdi 10,3,1
li 7,0
sld 4,4,5
sld 5,3,5
cmpwi 7,8,0
srd 10,10,6
sld 3,3,8
or 4,10,4
isel 5,5,7,28
isel 4,4,3,28
mr 3,5
blr
13 instructions.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
` (4 preceding siblings ...)
2020-06-17 15:28 ` jens.seifert at de dot ibm.com
@ 2020-06-17 16:56 ` segher at gcc dot gnu.org
2021-05-30 22:49 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: segher at gcc dot gnu.org @ 2020-06-17 16:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
--- Comment #6 from Segher Boessenkool <segher at gcc dot gnu.org> ---
13 insns, but the longest chain is 4. As I said, not totally awful, and
much better than the branchy code (which does not predict well, for more
general inputs anyway).
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/95704] PPC: int128 shifts should be implemented branchless
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
` (5 preceding siblings ...)
2020-06-17 16:56 ` segher at gcc dot gnu.org
@ 2021-05-30 22:49 ` pinskia at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-05-30 22:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-05-30 22:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 17:14 [Bug target/95704] New: PPC: int128 shifts should be implemented branchless jens.seifert at de dot ibm.com
2020-06-16 17:15 ` [Bug target/95704] " jens.seifert at de dot ibm.com
2020-06-17 12:56 ` segher at gcc dot gnu.org
2020-06-17 13:06 ` jens.seifert at de dot ibm.com
2020-06-17 14:53 ` segher at gcc dot gnu.org
2020-06-17 15:28 ` jens.seifert at de dot ibm.com
2020-06-17 16:56 ` segher at gcc dot gnu.org
2021-05-30 22:49 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).