[Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
@ 2022-02-21 11:06 pinskia at gcc dot gnu.org
  2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-21 11:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

            Bug ID: 104611
           Summary: memcmp/strcmp/strncmp can be optimized when the result
                    is tested for [in]equality with 0 on aarch64
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---

Take:

bool f(char *a)
{
    char t[] = "0123456789012345678901234567890";
    return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}

Right now GCC uses branches to optimize this but this could be done via a few
loads followed by xor (eor) of the two sides and then oring the results of xor
and then umavx and then comparing that to 0. This can be done for the
middle-end code too if there is a max reduction opcode.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
  2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
@ 2022-02-21 12:07 ` wilco at gcc dot gnu.org
  2022-10-19  3:02 ` zhongyunde at huawei dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: wilco at gcc dot gnu.org @ 2022-02-21 12:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #0)
> Take:
> 
> bool f(char *a)
> {
>     char t[] = "0123456789012345678901234567890";
>     return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
> }
> 
> Right now GCC uses branches to optimize this but this could be done via a
> few loads followed by xor (eor) of the two sides and then oring the results
> of xor
> and then umavx and then comparing that to 0. This can be done for the
> middle-end code too if there is a max reduction opcode.

It's not worth optimizing small inline memcmp using vector instructions - the
umaxv and move back to integer side adds extra latency.

However the expansion could be more efficient and use the same sequence used in
GLIBC memcmp:

        ldp     data1, data3, [src1, 16]
        ldp     data2, data4, [src2, 16]
        cmp     data1, data2
        ccmp    data3, data4, 0, eq
        b.ne    L(return2)

Also the array t[] gets copied on the stack instead of just using the string
literal directly.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
  2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
  2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
@ 2022-10-19  3:02 ` zhongyunde at huawei dot com
  2022-10-30  2:24 ` zhongyunde at huawei dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at huawei dot com @ 2022-10-19  3:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

vfdff <zhongyunde at huawei dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |zhongyunde at huawei dot com

--- Comment #2 from vfdff <zhongyunde at huawei dot com> ---
Add a runtime case https://gcc.godbolt.org/z/Tv1YP6bPc

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
  2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
  2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
  2022-10-19  3:02 ` zhongyunde at huawei dot com
@ 2022-10-30  2:24 ` zhongyunde at huawei dot com
  2023-09-25 10:26 ` redbeard0531 at gmail dot com
  2023-09-28 11:35 ` wilco at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: zhongyunde at huawei dot com @ 2022-10-30  2:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

--- Comment #3 from vfdff <zhongyunde at huawei dot com> ---
  As the load instructions usually have long latency, so do it need some extra
restrict when we try this transformation?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
  2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-10-30  2:24 ` zhongyunde at huawei dot com
@ 2023-09-25 10:26 ` redbeard0531 at gmail dot com
  2023-09-28 11:35 ` wilco at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: redbeard0531 at gmail dot com @ 2023-09-25 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Mathias Stearn <redbeard0531 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |redbeard0531 at gmail dot com

--- Comment #4 from Mathias Stearn <redbeard0531 at gmail dot com> ---
clang has already been using the optimized memcmp code since v16, even at -O1:
https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) were
still branch-free, but via a less optimal sequence of instructions.

GCC's code gets even more ridiculous at 32 bytes, because it does a branch
after every 8-byte compare, while the clang code is fully branch-free (not that
branch-free is always better, but it seems clearly so in this case).

Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
inability to take advantage of the load-pair instructions to load 16-bytes at a
time, and 2) an inability to use ccmp to combine comparisons. 3) using
branching rather than cset to fill the output register. Ideally these could all
be done in the general case by the low level instruction optimizer, but even
getting them special cased for memcmp (and friends) would be an improvement.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
  2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-09-25 10:26 ` redbeard0531 at gmail dot com
@ 2023-09-28 11:35 ` wilco at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: wilco at gcc dot gnu.org @ 2023-09-28 11:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-09-28

--- Comment #5 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Mathias Stearn from comment #4)
> clang has already been using the optimized memcmp code since v16, even at
> -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9)
> were still branch-free, but via a less optimal sequence of instructions.
> 
> GCC's code gets even more ridiculous at 32 bytes, because it does a branch
> after every 8-byte compare, while the clang code is fully branch-free (not
> that branch-free is always better, but it seems clearly so in this case).
> 
> Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
> inability to take advantage of the load-pair instructions to load 16-bytes
> at a time, and 2) an inability to use ccmp to combine comparisons. 3) using
> branching rather than cset to fill the output register. Ideally these could
> all be done in the general case by the low level instruction optimizer, but
> even getting them special cased for memcmp (and friends) would be an
> improvement.

I think 1, 2 and 3 are all related due to not having a TImode compare pattern,
so GCC splits things into 8-byte chunks using branches. We could add that and
see whether the result is better or add a backend expander for memcmp similar
to memset and memcpy.

Note what LLVM does is terrible, a 64-byte memcmp is ridiculously inefficient
due to long dependency chains, loading and comparing every byte even if there
is a mismatch in byte 0. So it's obviously better to use branches.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-28 11:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 11:06 [Bug target/104611] New: memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64 pinskia at gcc dot gnu.org
2022-02-21 12:07 ` [Bug target/104611] " wilco at gcc dot gnu.org
2022-10-19  3:02 ` zhongyunde at huawei dot com
2022-10-30  2:24 ` zhongyunde at huawei dot com
2023-09-25 10:26 ` redbeard0531 at gmail dot com
2023-09-28 11:35 ` wilco at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).