public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104773] New: compare with 1 not merged with subtract 1
@ 2022-03-03 16:02 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: peter at cordes dot ca @ 2022-03-03 16:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Bug ID: 104773
Summary: compare with 1 not merged with subtract 1
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*, arm-*-*
std::bit_ceil(x) involves if(x == 0 || x == 1) return 1;
and 1u << (32-clz(x-1)).
The compare of course compiles to an unsigned <= 1, which can be done with a
sub instead of cmp, producing the value we need as an input for the
leading-zero count. But GCC does *not* do this. (Neither does clang for
x86-64). I trimmed down the libstdc++ <bit> code into something I could
compile even when Godbolt is doesn't have working headers for some ISAs:
https://godbolt.org/z/3EE7W5bna
// cut down from libstdc++ for normal integer cases; compiles the same
template<typename _Tp>
constexpr _Tp
bit_ceil(_Tp __x) noexcept
{
constexpr auto _Nd = std::numeric_limits<_Tp>::digits;
if (__x == 0 || __x == 1)
return 1;
auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u));
// using __promoted_type = decltype(__x << 1); ... // removed check for
x<<n widening the result
return (_Tp)1u << __shift_exponent;
}
}
for x86-64 with GCC trunk -O3 -march=ivybridge, we get this inefficient code:
roundup(unsigned int):
mov eax, 1
cmp edi, 1
jbe .L1
sub edi, 1 # could have just done a sub in the first place
bsr edi, edi # correctly avoiding a false dependency by *not*
using ECX as the destination
lea ecx, [rdi+1] # could have shifted 2<<n instead of 1<<(n+1)
sal eax, cl # 3 uops, vs. 1 for bts is a more efficient way
to materialize 1<<n
.L1:
ret
Also, Ivybridge has no problem with DEC instead of SUB 1, IDK why it's avoiding
DEC here but not for Haswell for example. (Haswell pessimizes by using
32-lzcnt instead of lzcnt^31 or something, or still just BSR because it
performs identically on actual Haswell; lzcnt is only faster on AMD)
But this bug report is just about sub/cmp combining, not how to materialize
1<<(n+1) or other stuff: Better would be
sub edi, 1
jbe .L1
bsr edi, edi
xor eax, eax
inc edi
bts eax, edi # EAX |= 1<<EDI
ret
.L1:
mov eax, 1
ret
Intel SnB-family can macro-fuse sub/jbe. AMD can't, so the change is
break-even for front-end uops when the branch is not-taken, and worse when it
is taken. But it's still smaller code-size.
For ARM, clang finds a very clever way to combine it:
roundup(unsigned int):
subs r0, r0, #1
clz r0, r0
rsb r1, r0, #32 @ 32-clz
mov r0, #1
lslhi r0, r0, r1 @ using flags set by SUBS
bx lr @ 1<<(32-clz) or just 1
GCC on the other hand does much worse with -O3 -std=gnu++20 -mcpu=cortex-a53
-mthumb
roundup(unsigned int):
cmp r0, #1
itttt hi
addhi r3, r0, #-1
movhi r0, #1
clzhi r3, r3
rsbhi r3, r3, #32
ite hi
lslhi r0, r0, r3
movls r0, #1
bx lr
I suspect we could do better by combining the cmp and addhi, and doing `mov r0,
#1` outside of predication. (I think that's a separate bug, planning to report
it separately.) Then one `it` would be enough to cover things, I think.
That would basically reduce it to clang's strategy, although the predication of
the clz and rsb are optional.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-10-25 22:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-03 16:02 [Bug target/104773] New: compare with 1 not merged with subtract 1 peter at cordes dot ca
2022-03-04 1:19 ` [Bug target/104773] " crazylht at gmail dot com
2022-03-07 8:19 ` rguenth at gcc dot gnu.org
2023-09-25 10:45 ` [Bug rtl-optimization/104773] " cptarse-luke at yahoo dot com
2023-10-25 22:05 ` pinskia at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).