From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id 5B5613858D35 for ; Sun, 25 Jun 2023 05:39:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5B5613858D35 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Transfer-Encoding:Content-Type: MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:Cc:To:From:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=eQ4odnln8wp9iPE8EHSERn5U2Z4mfct/J4SOMFzlRWA=; b=Q2i1/MrL5J6ezhkQbxZoP7kX8x nwtTY/hcn5eLjCwtO26DqcxRbXs3N7+c8jQKPwS+nxDXeCE8r8dZ0rx+Hv2O8fKE2WvNPwIGI9AOE 7kCEGtM2C7esuhDC8Ry6FoLD05g8eEB7pnvZHRqvdBBGfOCYp/0/JvYxFY2uNuZvUD2w+HsV9ruyA pgOh+kurgetns3WTAQ5hlUM3MBOGc2vjpRqg+cgopgCfJ1OfH6sUdewnNQSK+Uh1r6+8IlG93CYfC NtSDyRK0n+B34RhHFjSgR6xpaPwaQcFmjPbhKuqd+JN5OveL3bC1+NZ3w65aR0j7NdgiXXmpz+Grx krxK4uWQ==; Received: from host86-161-68-50.range86-161.btcentralplus.com ([86.161.68.50]:53064 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qDIT1-0006TE-1I; Sun, 25 Jun 2023 01:39:19 -0400 From: "Roger Sayle" To: "'Richard Biener'" Cc: , "'Uros Bizjak'" , "'Jakub Jelinek'" References: <001001d99d36$a38111c0$ea833540$@nextmovesoftware.com> In-Reply-To: Subject: RE: [PATCH] New finish_compare_by_pieces target hook (for x86). Date: Sun, 25 Jun 2023 06:39:17 +0100 Message-ID: <001e01d9a727$6375adc0$2a610940$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHjUiizPUa+8vvkpBMWVKeUxxlQbQHdnoB0r3iw7nA= Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_BARRACUDACENTRAL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, 13 June 2023 12:02, Richard Biener wrote: > On Mon, Jun 12, 2023 at 4:04=E2=80=AFPM Roger Sayle = > wrote: > > The following simple test case, from PR 104610, shows that memcmp () > > =3D=3D 0 can result in some bizarre code sequences on x86. > > > > int foo(char *a) > > { > > static const char t[] =3D "0123456789012345678901234567890"; > > return __builtin_memcmp(a, &t[0], sizeof(t)) =3D=3D 0; } > > > > with -O2 currently contains both: > > xorl %eax, %eax > > xorl $1, %eax > > and also > > movl $1, %eax > > xorl $1, %eax > > > > Changing the return type of foo to _Bool results in the equally > > bizarre: > > xorl %eax, %eax > > testl %eax, %eax > > sete %al > > and also > > movl $1, %eax > > testl %eax, %eax > > sete %al > > > > All these sequences set the result to a constant, but this > > optimization opportunity only occurs very late during compilation, = by > > basic block duplication in the 322r.bbro pass, too late for CSE or > > peephole2 to do anything about it. The problem is that the idiom > > expanded by compare_by_pieces for __builtin_memcmp_eq contains basic > > blocks that can't easily be optimized by if-conversion due to the > > multiple incoming edges on the fail block. > > > > In summary, compare_by_pieces generates code that looks like: > > > > if (x[0] !=3D y[0]) goto fail_label; > > if (x[1] !=3D y[1]) goto fail_label; > > ... > > if (x[n] !=3D y[n]) goto fail_label; > > result =3D 1; > > goto end_label; > > fail_label: > > result =3D 0; > > end_label: > > > > In theory, the RTL if-conversion pass could be enhanced to tackle > > arbitrarily complex if-then-else graphs, but the solution proposed > > here is to allow suitable targets to perform if-conversion during > > compare_by_pieces. The x86, for example, can take advantage that = all > > of the above comparisons set and test the zero flag (ZF), which can > > then be used in combination with sete. Hence compare_by_pieces = could > > instead generate: > > > > if (x[0] !=3D y[0]) goto fail_label; > > if (x[1] !=3D y[1]) goto fail_label; > > ... > > if (x[n] !=3D y[n]) goto fail_label; > > fail_label: > > sete result > > > > which requires one less basic block, and the redundant conditional > > branch to a label immediately after is cleaned up by GCC's existing > > RTL optimizations. > > > > For the test case above, where -O2 -msse4 previously generated: > > > > foo: movdqu (%rdi), %xmm0 > > pxor .LC0(%rip), %xmm0 > > ptest %xmm0, %xmm0 > > je .L5 > > .L2: movl $1, %eax > > xorl $1, %eax > > ret > > .L5: movdqu 16(%rdi), %xmm0 > > pxor .LC1(%rip), %xmm0 > > ptest %xmm0, %xmm0 > > jne .L2 > > xorl %eax, %eax > > xorl $1, %eax > > ret > > > > we now generate: > > > > foo: movdqu (%rdi), %xmm0 > > pxor .LC0(%rip), %xmm0 > > ptest %xmm0, %xmm0 > > jne .L2 > > movdqu 16(%rdi), %xmm0 > > pxor .LC1(%rip), %xmm0 > > ptest %xmm0, %xmm0 > > .L2: sete %al > > movzbl %al, %eax > > ret > > > > Using a target hook allows the large amount of intelligence already = in > > compare_by_pieces to be re-used by the i386 backend, but this can = also > > help other backends with condition flags where the equality result = can > > be materialized. > > > > This patch has been tested on x86_64-pc-linux-gnu with make = bootstrap > > and make -k check, both with and without --target_board=3Dunix{-m32} > > with no new failures. Ok for mainline? >=20 > What's the guarantee that the zero flag is appropriately set on all = edges incoming > now and forever? Is there any reason why this target hook can't be removed (in future) = should it stop being useful? It's completely optional and not required for the correct = functioning of the compiler. > Does this require target specific knowledge on how = do_compare_rtx_and_jump > is emitting RTL? Yes. Each backend can decide how best to implement = finish_compare_by_pieces given its internal knowledge of how do_compare_rtx_and_jump works. It's = not important to the middle-end how the underlying invariants are = guaranteed, just that they are and the backend produces correct code. A backend may = store flags on the target label, or maintain state in cfun. Future changes to the = i386 backend might cause it to revert to the default finish_compare_by_pieces, or = provide an alternate implementation, but at the moment this patch improves the code = that GCC generates. Very little (in software like GCC) is forever. > Do you see matching this in ifcvt to be unreasonable? I'm thinking of = "reducing" > the incoming edges pairwise without actually looking at the ifcvt = code. There's nothing about the proposed patch that prevents or blocks = improvements to ifcvt, which I agree completely could be improved. But even (in = future) if later RTL passes could clean things up, that's no reason for RTL expansion to = initially generate poor/inefficient code. I'm not sure that a (hypothetical) = ifcvt improvement would be sufficient reason to revert/remove enhancements to = compare_by_pieces. Is it that there's not enough (bootstrap and) testsuite coverage of = compare_by_pieces to make you feel comfortable with this change? The proposed = implementation should be "obvious enough" to future generations what the intended behaviour = should be. And the x86 target hook implementation (i.e. the change) is only four = lines long, a fraction of the size of the new documentation and comments. Thanks in advance. Roger > > 2023-06-12 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386.cc (ix86_finish_compare_by_pieces): New > > function to provide a backend specific implementation. > > (TARGET_FINISH_COMPARE_BY_PIECES): Use the above function. > > > > * doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New = @hook. > > * doc/tm.texi: Regenerate. > > > > * expr.cc (compare_by_pieces): Call finish_compare_by_pieces = in > > targetm to finalize the RTL expansion. Move the current > > implementation to a default target hook. > > * target.def (finish_compare_by_pieces): New target hook to = allow > > compare_by_pieces to be customized by the target. > > * targhooks.cc (default_finish_compare_by_pieces): Default > > implementation moved here from expr.cc's compare_by_pieces. > > * targhooks.h (default_finish_compare_by_pieces): Prototype. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/pieces-memcmp-1.c: New test case. > >