public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104773] New: compare with 1 not merged with subtract 1
@ 2022-03-03 16:02 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: peter at cordes dot ca @ 2022-03-03 16:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Bug ID: 104773
Summary: compare with 1 not merged with subtract 1
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*, arm-*-*
std::bit_ceil(x) involves if(x == 0 || x == 1) return 1;
and 1u << (32-clz(x-1)).
The compare of course compiles to an unsigned <= 1, which can be done with a
sub instead of cmp, producing the value we need as an input for the
leading-zero count. But GCC does *not* do this. (Neither does clang for
x86-64). I trimmed down the libstdc++ <bit> code into something I could
compile even when Godbolt is doesn't have working headers for some ISAs:
https://godbolt.org/z/3EE7W5bna
// cut down from libstdc++ for normal integer cases; compiles the same
template<typename _Tp>
constexpr _Tp
bit_ceil(_Tp __x) noexcept
{
constexpr auto _Nd = std::numeric_limits<_Tp>::digits;
if (__x == 0 || __x == 1)
return 1;
auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u));
// using __promoted_type = decltype(__x << 1); ... // removed check for
x<<n widening the result
return (_Tp)1u << __shift_exponent;
}
}
for x86-64 with GCC trunk -O3 -march=ivybridge, we get this inefficient code:
roundup(unsigned int):
mov eax, 1
cmp edi, 1
jbe .L1
sub edi, 1 # could have just done a sub in the first place
bsr edi, edi # correctly avoiding a false dependency by *not*
using ECX as the destination
lea ecx, [rdi+1] # could have shifted 2<<n instead of 1<<(n+1)
sal eax, cl # 3 uops, vs. 1 for bts is a more efficient way
to materialize 1<<n
.L1:
ret
Also, Ivybridge has no problem with DEC instead of SUB 1, IDK why it's avoiding
DEC here but not for Haswell for example. (Haswell pessimizes by using
32-lzcnt instead of lzcnt^31 or something, or still just BSR because it
performs identically on actual Haswell; lzcnt is only faster on AMD)
But this bug report is just about sub/cmp combining, not how to materialize
1<<(n+1) or other stuff: Better would be
sub edi, 1
jbe .L1
bsr edi, edi
xor eax, eax
inc edi
bts eax, edi # EAX |= 1<<EDI
ret
.L1:
mov eax, 1
ret
Intel SnB-family can macro-fuse sub/jbe. AMD can't, so the change is
break-even for front-end uops when the branch is not-taken, and worse when it
is taken. But it's still smaller code-size.
For ARM, clang finds a very clever way to combine it:
roundup(unsigned int):
subs r0, r0, #1
clz r0, r0
rsb r1, r0, #32 @ 32-clz
mov r0, #1
lslhi r0, r0, r1 @ using flags set by SUBS
bx lr @ 1<<(32-clz) or just 1
GCC on the other hand does much worse with -O3 -std=gnu++20 -mcpu=cortex-a53
-mthumb
roundup(unsigned int):
cmp r0, #1
itttt hi
addhi r3, r0, #-1
movhi r0, #1
clzhi r3, r3
rsbhi r3, r3, #32
ite hi
lslhi r0, r0, r3
movls r0, #1
bx lr
I suspect we could do better by combining the cmp and addhi, and doing `mov r0,
#1` outside of predication. (I think that's a separate bug, planning to report
it separately.) Then one `it` would be enough to cover things, I think.
That would basically reduce it to clang's strategy, although the predication of
the clz and rsb are optional.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/104773] compare with 1 not merged with subtract 1
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
@ 2022-03-04 1:19 ` crazylht at gmail dot com
2022-03-07 8:19 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2022-03-04 1:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
--- Comment #1 from Hongtao.liu <crazylht at gmail dot com> ---
It looks like the same issue as PR98977.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/104773] compare with 1 not merged with subtract 1
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
@ 2022-03-07 8:19 ` rguenth at gcc dot gnu.org
2023-09-25 10:45 ` [Bug rtl-optimization/104773] " cptarse-luke at yahoo dot com
2023-10-25 22:05 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-07 8:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2022-03-07
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. Depending on how the unsigned <= 1 is represented on RTL this might
or might not be "easily" taught to RTL PRE (likely RTL CSE is too local to
catch it, but at least the fallthru path might form an EBB)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug rtl-optimization/104773] compare with 1 not merged with subtract 1
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
2022-03-07 8:19 ` rguenth at gcc dot gnu.org
@ 2023-09-25 10:45 ` cptarse-luke at yahoo dot com
2023-10-25 22:05 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: cptarse-luke at yahoo dot com @ 2023-09-25 10:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Luke <cptarse-luke at yahoo dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |cptarse-luke at yahoo dot com
--- Comment #3 from Luke <cptarse-luke at yahoo dot com> ---
*** Bug 111500 has been marked as a duplicate of this bug. ***
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug rtl-optimization/104773] compare with 1 not merged with subtract 1
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
` (2 preceding siblings ...)
2023-09-25 10:45 ` [Bug rtl-optimization/104773] " cptarse-luke at yahoo dot com
@ 2023-10-25 22:05 ` pinskia at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-10-25 22:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 56314
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56314&action=edit
testcase
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-10-25 22:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
2022-03-07 8:19 ` rguenth at gcc dot gnu.org
2023-09-25 10:45 ` [Bug rtl-optimization/104773] " cptarse-luke at yahoo dot com
2023-10-25 22:05 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).