From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=t2Da=CO=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 2D32E3858D20
	for <gcc-patches@gcc.gnu.org>; Mon, 26 Jun 2023 18:24:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2D32E3858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DA6212F4;
	Mon, 26 Jun 2023 11:25:11 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 064013F64C;
	Mon, 26 Jun 2023 11:24:26 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
Mail-Followup-To: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>,Roger Sayle <roger@nextmovesoftware.com>,  Richard Biener <richard.guenther@gmail.com>,  Uros Bizjak <ubizjak@gmail.com>,  Jakub Jelinek <jakub@redhat.com>, richard.sandiford@arm.com
Cc: Roger Sayle <roger@nextmovesoftware.com>,  Richard Biener <richard.guenther@gmail.com>,  Uros Bizjak <ubizjak@gmail.com>,  Jakub Jelinek <jakub@redhat.com>
Subject: Re: [PATCH] New finish_compare_by_pieces target hook (for x86).
References: <001001d99d36$a38111c0$ea833540$@nextmovesoftware.com>
	<CAFiYyc1Jo8CTT3_BOQ9gq_yfH4brZwkB3oS_QV4QWC568GDbWg@mail.gmail.com>
	<001e01d9a727$6375adc0$2a610940$@nextmovesoftware.com>
	<CAFiYyc2Z0e0z-mOo7n0rmaqQ85toSqdn1pFKXh-sE+pGC6UBrQ@mail.gmail.com>
Date: Mon, 26 Jun 2023 19:24:25 +0100
In-Reply-To: <CAFiYyc2Z0e0z-mOo7n0rmaqQ85toSqdn1pFKXh-sE+pGC6UBrQ@mail.gmail.com>
	(Richard Biener via Gcc-patches's message of "Mon, 26 Jun 2023
	09:18:00 +0200")
Message-ID: <mptjzvqcc9y.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-21.4 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Sun, Jun 25, 2023 at 7:39=E2=80=AFAM Roger Sayle <roger@nextmovesoftwa=
re.com> wrote:
>>
>>
>> On Tue, 13 June 2023 12:02, Richard Biener wrote:
>> > On Mon, Jun 12, 2023 at 4:04=E2=80=AFPM Roger Sayle <roger@nextmovesof=
tware.com>
>> > wrote:
>> > > The following simple test case, from PR 104610, shows that memcmp ()
>> > > =3D=3D 0 can result in some bizarre code sequences on x86.
>> > >
>> > > int foo(char *a)
>> > > {
>> > >     static const char t[] =3D "0123456789012345678901234567890";
>> > >     return __builtin_memcmp(a, &t[0], sizeof(t)) =3D=3D 0; }
>> > >
>> > > with -O2 currently contains both:
>> > >         xorl    %eax, %eax
>> > >         xorl    $1, %eax
>> > > and also
>> > >         movl    $1, %eax
>> > >         xorl    $1, %eax
>> > >
>> > > Changing the return type of foo to _Bool results in the equally
>> > > bizarre:
>> > >         xorl    %eax, %eax
>> > >         testl   %eax, %eax
>> > >         sete    %al
>> > > and also
>> > >         movl    $1, %eax
>> > >         testl   %eax, %eax
>> > >         sete    %al
>> > >
>> > > All these sequences set the result to a constant, but this
>> > > optimization opportunity only occurs very late during compilation, by
>> > > basic block duplication in the 322r.bbro pass, too late for CSE or
>> > > peephole2 to do anything about it.  The problem is that the idiom
>> > > expanded by compare_by_pieces for __builtin_memcmp_eq contains basic
>> > > blocks that can't easily be optimized by if-conversion due to the
>> > > multiple incoming edges on the fail block.
>> > >
>> > > In summary, compare_by_pieces generates code that looks like:
>> > >
>> > >         if (x[0] !=3D y[0]) goto fail_label;
>> > >         if (x[1] !=3D y[1]) goto fail_label;
>> > >         ...
>> > >         if (x[n] !=3D y[n]) goto fail_label;
>> > >         result =3D 1;
>> > >         goto end_label;
>> > > fail_label:
>> > >         result =3D 0;
>> > > end_label:
>> > >
>> > > In theory, the RTL if-conversion pass could be enhanced to tackle
>> > > arbitrarily complex if-then-else graphs, but the solution proposed
>> > > here is to allow suitable targets to perform if-conversion during
>> > > compare_by_pieces.  The x86, for example, can take advantage that all
>> > > of the above comparisons set and test the zero flag (ZF), which can
>> > > then be used in combination with sete.  Hence compare_by_pieces could
>> > > instead generate:
>> > >
>> > >         if (x[0] !=3D y[0]) goto fail_label;
>> > >         if (x[1] !=3D y[1]) goto fail_label;
>> > >         ...
>> > >         if (x[n] !=3D y[n]) goto fail_label;
>> > > fail_label:
>> > >         sete result
>> > >
>> > > which requires one less basic block, and the redundant conditional
>> > > branch to a label immediately after is cleaned up by GCC's existing
>> > > RTL optimizations.
>> > >
>> > > For the test case above, where -O2 -msse4 previously generated:
>> > >
>> > > foo:    movdqu  (%rdi), %xmm0
>> > >         pxor    .LC0(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         je      .L5
>> > > .L2:    movl    $1, %eax
>> > >         xorl    $1, %eax
>> > >         ret
>> > > .L5:    movdqu  16(%rdi), %xmm0
>> > >         pxor    .LC1(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         jne     .L2
>> > >         xorl    %eax, %eax
>> > >         xorl    $1, %eax
>> > >         ret
>> > >
>> > > we now generate:
>> > >
>> > > foo:    movdqu  (%rdi), %xmm0
>> > >         pxor    .LC0(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > >         jne     .L2
>> > >         movdqu  16(%rdi), %xmm0
>> > >         pxor    .LC1(%rip), %xmm0
>> > >         ptest   %xmm0, %xmm0
>> > > .L2:    sete    %al
>> > >         movzbl  %al, %eax
>> > >         ret
>> > >
>> > > Using a target hook allows the large amount of intelligence already =
in
>> > > compare_by_pieces to be re-used by the i386 backend, but this can al=
so
>> > > help other backends with condition flags where the equality result c=
an
>> > > be materialized.
>> > >
>> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>> > > and make -k check, both with and without --target_board=3Dunix{-m32}
>> > > with no new failures.  Ok for mainline?
>> >
>> > What's the guarantee that the zero flag is appropriately set on all ed=
ges incoming
>> > now and forever?
>>
>> Is there any reason why this target hook can't be removed (in future) sh=
ould it stop
>> being useful?  It's completely optional and not required for the correct=
 functioning
>> of the compiler.
>>
>> > Does this require target specific knowledge on how do_compare_rtx_and_=
jump
>> > is emitting RTL?
>>
>> Yes.  Each backend can decide how best to implement finish_compare_by_pi=
eces
>> given its internal knowledge of how do_compare_rtx_and_jump works.  It's=
 not
>> important to the middle-end how the underlying invariants are guaranteed=
, just
>> that they are and the backend produces correct code.  A backend may stor=
e flags
>> on the target label, or maintain state in cfun.  Future changes to the i=
386 backend
>> might cause it to revert to the default finish_compare_by_pieces, or pro=
vide an
>> alternate implementation, but at the moment this patch improves the code=
 that
>> GCC generates.  Very little (in software like GCC) is forever.
>>
>> > Do you see matching this in ifcvt to be unreasonable?  I'm thinking of=
 "reducing"
>> > the incoming edges pairwise without actually looking at the ifcvt code.
>>
>> There's nothing about the proposed patch that prevents or blocks improve=
ments
>> to ifcvt, which I agree completely could be improved.  But even (in futu=
re) if later
>> RTL passes could clean things up, that's no reason for RTL expansion to =
initially
>> generate poor/inefficient code.  I'm not sure that a (hypothetical) ifcv=
t improvement
>> would be sufficient reason to revert/remove enhancements to compare_by_p=
ieces.
>>
>> Is it that there's not enough (bootstrap and) testsuite coverage of comp=
are_by_pieces
>> to make you feel comfortable with this change?  The proposed implementat=
ion should
>> be "obvious enough" to future generations what the intended behaviour sh=
ould be.
>> And the x86 target hook implementation (i.e. the change) is only four li=
nes long, a
>> fraction of the size of the new documentation and comments.
>
> My main concern was that we are communicating "implicit" dependences betw=
een
> the target hook and expand RTL code generation - we don't seem to pass do=
wn
> enough info for example to have the target verify constraints and
> excuse itself if
> they do not hold.  They also seem to be poorly documented and the
> compare_by_pieces (and all _by_pieces) stuff has been extended to be quite
> "configurable" and thus is probably prone to emit vastly different RTL in
> some special cases?

FWIW, I share these concerns, especially about the target hook not
being able to verify its assumptions.  Maybe it wouldn't matter too
much if the worst outcome was a missed optimisation, but I don't think
the mechanism is reliable enough for us to rely on it for correctness.

Richard