From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id C301B3857BB2 for ; Thu, 14 Jul 2022 05:31:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C301B3857BB2 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Transfer-Encoding:Content-Type: MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:Cc:To:From:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=PMv6aqQDtR8kxBiN5GSmSINpEhYTYxr1lp/glukT2FI=; b=rdJygKO6TeByljy6IQr1Am1qdJ 6qnJVsx8w37I+yqOH5hhL6QKLFqtrsrvHUQkqKmJ/vTAH50I91c9kboMvuDLEt3OeWw8AnYYGuIHq JGjWdOmR36HXJ9w3sRwDpWPJfDQC+7ZYlUhBpyhg4ZaOGqcBBrD7HxpvGQcd4Q0Khck5H8+RArzmN hR1nLGPFWRcdbXlP3KXcFjLRjx6NunW4RpJBAI1/73MVLajF/3FMfecH9/iNpDF4fZKQkoQDc+jWC EqcAiHQkTtieX/aD7wvh3eVRB+tZlL9jVubF4bHzO/RvlP9xjxfHfNp8Tz4AogwFuP3C93ZatkUE/ Lc2UCvMA==; Received: from host109-154-33-170.range109-154.btcentralplus.com ([109.154.33.170]:50439 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1oBrRy-0003Bg-SA; Thu, 14 Jul 2022 01:31:47 -0400 From: "Roger Sayle" To: "'H.J. Lu'" Cc: "'Uros Bizjak'" , "'GCC Patches'" References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com> <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com> <014a01d894a5$71189220$5349b660$@nextmovesoftware.com> In-Reply-To: Subject: RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode. Date: Thu, 14 Jul 2022 06:31:44 +0100 Message-ID: <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQDLEU72DOSSolIGHYPmM3CDaDY5HQFm5hbmAaCooigCBX1m0gFZl94kAl2bbA6vUjanMA== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2022 05:31:49 -0000 On Mon, Jul 11, 2022, H.J. Lu wrote: > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle = > wrote: > > Hi HJ, > > > > I believe this should now be handled by the post-reload (CSE) pass. > > Consider the simple test case: > > > > __int128 a, b, c; > > void foo() > > { > > a =3D 0; > > b =3D 0; > > c =3D 0; > > } > > > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > > movq $0, a(%rip) > > movq $0, a+8(%rip) > > movq $0, b(%rip) > > movq $0, b+8(%rip) > > movq $0, c(%rip) > > movq $0, c+8(%rip) > > ret > > > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > > pxor %xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > movaps %xmm0, b(%rip) > > movaps %xmm0, c(%rip) > > ret > > > > You're quite right internally the STV actually generates the = equivalent of: > > pxor %xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > pxor %xmm0, %xmm0 > > movaps %xmm0, b(%rip) > > pxor %xmm0, %xmm0 > > movaps %xmm0, c(%rip) > > ret > > > > And currently because STV run before cse2 and combine, the = const0_rtx > > gets CSE'd be the cse2 pass to produce the code we see. However, if > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), > > you'll see we continue to generate the same optimized code, as the > > same const0_rtx gets CSE'd in postreload. > > > > I can't be certain until I try the experiment, but I believe that = the > > postreload CSE will clean-up, all of the same common subexpressions. > > Hence, it should be safe to perform all STV at the same point (after > > combine), which for a few additional optimizations. > > > > Does this make sense? Do you have a test case, > > -fno-rerun-cse-after-loop produces different/inferior code for = TImode STV > chains? > > > > My guess is that the RTL passes have changed so much in the last six > > or seven years, that some of the original motivation no longer = applies. > > Certainly we now try to keep TI mode operations visible longer, and > > then allow STV to behave like a pre-reload pass to decide which set = of > > registers to use (vector V1TI or scalar doubleword DI). Any CSE > > opportunities that cse2 finds with V1TI mode, could/should equally > > well be found for TI mode (mostly). >=20 > You are probably right. If there are no regressions in GCC testsuite, = my original > motivation is no longer valid. It was good to try the experiment, but H.J. is right, there is still = some benefit (as well as some disadvantages) to running STV lowering before = CSE2/combine. A clean-up patch to perform all STV conversion as a single pass = (removing a pass from the compiler) results in just a single regression in the test = suite: FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal = 8 which looks like: __int128 a, b, c, d, e, f; void foo (void) { a =3D 0; b =3D -1; c =3D 0; d =3D -1; e =3D 0; f =3D -1; } By performing STV after combine (without CSE), reload prefers to = implement this function using a single register, that then requires 12 = instructions rather than 8 (if using two registers). Alas there's nothing that postreload = CSE/GCSE can do. Doh! pxor %xmm0, %xmm0 movaps %xmm0, a(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, b(%rip) pxor %xmm0, %xmm0 movaps %xmm0, c(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, d(%rip) pxor %xmm0, %xmm0 movaps %xmm0, e(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, f(%rip) ret I also note that even without STV, the scalar implementation of this = function when compiled with -Os is also larger than it needs to be due to poor CSE = (notice in the following we only need a single zero register, and an all_ones reg = would be helpful). xorl %eax, %eax xorl %edx, %edx xorl %ecx, %ecx movq $-1, b(%rip) movq %rax, a(%rip) movq %rax, a+8(%rip) movq $-1, b+8(%rip) movq %rdx, c(%rip) movq %rdx, c+8(%rip) movq $-1, d(%rip) movq $-1, d+8(%rip) movq %rcx, e(%rip) movq %rcx, e+8(%rip) movq $-1, f(%rip) movq $-1, f+8(%rip) ret I need to give the problem some more thought. It would be good to = clean-up/unify the STV passes, but I/we need to solve/CSE HJ's last test case before we = do. Perhaps by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would = force the zero register to become visible, and CSE'd, benefiting both vector code and = scalar -Os code, then use postreload/peephole2 to fix up the remaining scalar cases. = It's tricky. Cheers, Roger --