public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
@ 2022-02-21 11:06 pinskia at gcc dot gnu.org
2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-21 11:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
Bug ID: 104611
Summary: memcmp/strcmp/strncmp can be optimized when the result
is tested for [in]equality with 0 on aarch64
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Take:
bool f(char *a)
{
char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}
Right now GCC uses branches to optimize this but this could be done via a few
loads followed by xor (eor) of the two sides and then oring the results of xor
and then umavx and then comparing that to 0. This can be done for the
middle-end code too if there is a max reduction opcode.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
@ 2022-02-21 12:07 ` wilco at gcc dot gnu.org
2022-10-19 3:02 ` zhongyunde at huawei dot com
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: wilco at gcc dot gnu.org @ 2022-02-21 12:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
Wilco <wilco at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |wilco at gcc dot gnu.org
--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #0)
> Take:
>
> bool f(char *a)
> {
> char t[] = "0123456789012345678901234567890";
> return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
> }
>
> Right now GCC uses branches to optimize this but this could be done via a
> few loads followed by xor (eor) of the two sides and then oring the results
> of xor
> and then umavx and then comparing that to 0. This can be done for the
> middle-end code too if there is a max reduction opcode.
It's not worth optimizing small inline memcmp using vector instructions - the
umaxv and move back to integer side adds extra latency.
However the expansion could be more efficient and use the same sequence used in
GLIBC memcmp:
ldp data1, data3, [src1, 16]
ldp data2, data4, [src2, 16]
cmp data1, data2
ccmp data3, data4, 0, eq
b.ne L(return2)
Also the array t[] gets copied on the stack instead of just using the string
literal directly.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
@ 2022-10-19 3:02 ` zhongyunde at huawei dot com
2022-10-30 2:24 ` zhongyunde at huawei dot com
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at huawei dot com @ 2022-10-19 3:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
vfdff <zhongyunde at huawei dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |zhongyunde at huawei dot com
--- Comment #2 from vfdff <zhongyunde at huawei dot com> ---
Add a runtime case https://gcc.godbolt.org/z/Tv1YP6bPc
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
2022-10-19 3:02 ` zhongyunde at huawei dot com
@ 2022-10-30 2:24 ` zhongyunde at huawei dot com
2023-09-25 10:26 ` redbeard0531 at gmail dot com
2023-09-28 11:35 ` wilco at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at huawei dot com @ 2022-10-30 2:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
--- Comment #3 from vfdff <zhongyunde at huawei dot com> ---
As the load instructions usually have long latency, so do it need some extra
restrict when we try this transformation?
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
` (2 preceding siblings ...)
2022-10-30 2:24 ` zhongyunde at huawei dot com
@ 2023-09-25 10:26 ` redbeard0531 at gmail dot com
2023-09-28 11:35 ` wilco at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: redbeard0531 at gmail dot com @ 2023-09-25 10:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
Mathias Stearn <redbeard0531 at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |redbeard0531 at gmail dot com
--- Comment #4 from Mathias Stearn <redbeard0531 at gmail dot com> ---
clang has already been using the optimized memcmp code since v16, even at -O1:
https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) were
still branch-free, but via a less optimal sequence of instructions.
GCC's code gets even more ridiculous at 32 bytes, because it does a branch
after every 8-byte compare, while the clang code is fully branch-free (not that
branch-free is always better, but it seems clearly so in this case).
Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
inability to take advantage of the load-pair instructions to load 16-bytes at a
time, and 2) an inability to use ccmp to combine comparisons. 3) using
branching rather than cset to fill the output register. Ideally these could all
be done in the general case by the low level instruction optimizer, but even
getting them special cased for memcmp (and friends) would be an improvement.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
` (3 preceding siblings ...)
2023-09-25 10:26 ` redbeard0531 at gmail dot com
@ 2023-09-28 11:35 ` wilco at gcc dot gnu.org
4 siblings, 0 replies; 6+ messages in thread
From: wilco at gcc dot gnu.org @ 2023-09-28 11:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611
Wilco <wilco at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-09-28
--- Comment #5 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Mathias Stearn from comment #4)
> clang has already been using the optimized memcmp code since v16, even at
> -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9)
> were still branch-free, but via a less optimal sequence of instructions.
>
> GCC's code gets even more ridiculous at 32 bytes, because it does a branch
> after every 8-byte compare, while the clang code is fully branch-free (not
> that branch-free is always better, but it seems clearly so in this case).
>
> Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
> inability to take advantage of the load-pair instructions to load 16-bytes
> at a time, and 2) an inability to use ccmp to combine comparisons. 3) using
> branching rather than cset to fill the output register. Ideally these could
> all be done in the general case by the low level instruction optimizer, but
> even getting them special cased for memcmp (and friends) would be an
> improvement.
I think 1, 2 and 3 are all related due to not having a TImode compare pattern,
so GCC splits things into 8-byte chunks using branches. We could add that and
see whether the result is better or add a backend expander for memcmp similar
to memset and memcpy.
Note what LLVM does is terrible, a 64-byte memcmp is ridiculously inefficient
due to long dependency chains, loading and comparing every byte even if there
is a mismatch in byte 0. So it's obviously better to use branches.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-09-28 11:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
2022-10-19 3:02 ` zhongyunde at huawei dot com
2022-10-30 2:24 ` zhongyunde at huawei dot com
2023-09-25 10:26 ` redbeard0531 at gmail dot com
2023-09-28 11:35 ` wilco at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).