The following simple test case, from PR 104610, shows that memcmp () == 0
can result in some bizarre code sequences on x86.

int foo(char *a)
{
    static const char t[] = "0123456789012345678901234567890";
    return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}

with -O2 currently contains both:
        xorl    %eax, %eax
        xorl    $1, %eax
and also
        movl    $1, %eax
        xorl    $1, %eax

Changing the return type of foo to _Bool results in the equally
bizarre:
        xorl    %eax, %eax
        testl   %eax, %eax
        sete    %al
and also
        movl    $1, %eax
        testl   %eax, %eax
        sete    %al

All these sequences set the result to a constant, but this optimization
opportunity only occurs very late during compilation, by basic block
duplication in the 322r.bbro pass, too late for CSE or peephole2 to
do anything about it.  The problem is that the idiom expanded by
compare_by_pieces for __builtin_memcmp_eq contains basic blocks that
can't easily be optimized by if-conversion due to the multiple
incoming edges on the fail block.

In summary, compare_by_pieces generates code that looks like:

        if (x[0] != y[0]) goto fail_label;
        if (x[1] != y[1]) goto fail_label;
        ...
        if (x[n] != y[n]) goto fail_label;
        result = 1;
        goto end_label;
fail_label:
        result = 0;
end_label:

In theory, the RTL if-conversion pass could be enhanced to tackle
arbitrarily complex if-then-else graphs, but the solution proposed
here is to allow suitable targets to perform if-conversion during
compare_by_pieces.  The x86, for example, can take advantage that
all of the above comparisons set and test the zero flag (ZF), which
can then be used in combination with sete.  Hence compare_by_pieces
could instead generate:

        if (x[0] != y[0]) goto fail_label;
        if (x[1] != y[1]) goto fail_label;
        ...
        if (x[n] != y[n]) goto fail_label;
fail_label:
        sete result

which requires one less basic block, and the redundant conditional
branch to a label immediately after is cleaned up by GCC's existing
RTL optimizations.

For the test case above, where -O2 -msse4 previously generated:

foo:    movdqu  (%rdi), %xmm0
        pxor    .LC0(%rip), %xmm0
        ptest   %xmm0, %xmm0
        je      .L5
.L2:    movl    $1, %eax
        xorl    $1, %eax
        ret
.L5:    movdqu  16(%rdi), %xmm0
        pxor    .LC1(%rip), %xmm0
        ptest   %xmm0, %xmm0
        jne     .L2
        xorl    %eax, %eax
        xorl    $1, %eax
        ret

we now generate:

foo:    movdqu  (%rdi), %xmm0
        pxor    .LC0(%rip), %xmm0
        ptest   %xmm0, %xmm0
        jne     .L2
        movdqu  16(%rdi), %xmm0
        pxor    .LC1(%rip), %xmm0
        ptest   %xmm0, %xmm0
.L2:    sete    %al
        movzbl  %al, %eax
        ret

Using a target hook allows the large amount of intelligence already in
compare_by_pieces to be re-used by the i386 backend, but this can also
help other backends with condition flags where the equality result can
be materialized.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-06-12  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386.cc (ix86_finish_compare_by_pieces): New
        function to provide a backend specific implementation.
        (TARGET_FINISH_COMPARE_BY_PIECES): Use the above function.

        * doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New @hook.
        * doc/tm.texi: Regenerate.

        * expr.cc (compare_by_pieces): Call finish_compare_by_pieces in
        targetm to finalize the RTL expansion.  Move the current
        implementation to a default target hook.
        * target.def (finish_compare_by_pieces): New target hook to allow
        compare_by_pieces to be customized by the target.
        * targhooks.cc (default_finish_compare_by_pieces): Default
        implementation moved here from expr.cc's compare_by_pieces.
        * targhooks.h (default_finish_compare_by_pieces): Prototype.

gcc/testsuite/ChangeLog
        * gcc.target/i386/pieces-memcmp-1.c: New test case.


Thanks in advance,
Roger
--