From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by sourceware.org (Postfix) with ESMTPS id 36983385781F for ; Thu, 14 Jul 2022 07:10:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 36983385781F Received: by mail-ed1-x52e.google.com with SMTP id fd6so1232025edb.5 for ; Thu, 14 Jul 2022 00:10:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7bPAqbfyH+DWWH8subPGydUVZSVTGJlSvjcuKO2sAMU=; b=z5hyhT9zVdQc57WGUvL4J3XKYz/rA1rsCckuB/8ApxBR8MUmuDkajfGf8PVS3DR4n+ 0EA5fzS4ne+oZM5zKBqglJhpqb2TRzdPevQ5PGsn7yz0d0v+kyFvAFrZjowaftORRH5z gBH5VXTGn9ALvjaYJ4hYGgVJVlB7nMvMuY2cVHw6UKb2lBf8xFnkwUwD8UEFSDyoTPJI SIAp/VxnJuqLDCrd4Jbu354w4uY5p6UwQe3WDgzP+yZKoZGGMKOcLR1VhU5qPftcSuYL VedUoH34/PwNpycWdqmATExIgGNfW/Ic4VIyq4z3tiT6vI8vKTaE2Tn4LT6bux212YoR M+AQ== X-Gm-Message-State: AJIora9Ih4gYEa17RJyBHbAKQtqI5yzu/fX+rPgh+PnVbvrLelRv4YjS HsgZIC2wn3xbkHHJODh50zLqFYv6/m849AzvCFs= X-Google-Smtp-Source: AGRyM1uz926oGFMREy6fDuL3ZfW5ruFVxN6evCcxFGsOxqD6JWCWJ0FIarRKkWLxUVKL+JhSi75iQsNcgPk68HVRQYk= X-Received: by 2002:a05:6402:4488:b0:43a:7b6e:4b04 with SMTP id er8-20020a056402448800b0043a7b6e4b04mr10358399edb.202.1657782640795; Thu, 14 Jul 2022 00:10:40 -0700 (PDT) MIME-Version: 1.0 References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com> <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com> <014a01d894a5$71189220$5349b660$@nextmovesoftware.com> <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com> In-Reply-To: <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com> From: Richard Biener Date: Thu, 14 Jul 2022 09:10:28 +0200 Message-ID: Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode. To: Roger Sayle Cc: "H.J. Lu" , GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Jul 2022 07:10:44 -0000 On Thu, Jul 14, 2022 at 7:32 AM Roger Sayle wrote: > > > On Mon, Jul 11, 2022, H.J. Lu wrote: > > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle > > wrote: > > > Hi HJ, > > > > > > I believe this should now be handled by the post-reload (CSE) pass. > > > Consider the simple test case: > > > > > > __int128 a, b, c; > > > void foo() > > > { > > > a = 0; > > > b = 0; > > > c = 0; > > > } > > > > > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > > > movq $0, a(%rip) > > > movq $0, a+8(%rip) > > > movq $0, b(%rip) > > > movq $0, b+8(%rip) > > > movq $0, c(%rip) > > > movq $0, c+8(%rip) > > > ret > > > > > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, a(%rip) > > > movaps %xmm0, b(%rip) > > > movaps %xmm0, c(%rip) > > > ret > > > > > > You're quite right internally the STV actually generates the equivalent of: > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, a(%rip) > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, b(%rip) > > > pxor %xmm0, %xmm0 > > > movaps %xmm0, c(%rip) > > > ret > > > > > > And currently because STV run before cse2 and combine, the const0_rtx > > > gets CSE'd be the cse2 pass to produce the code we see. However, if > > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), > > > you'll see we continue to generate the same optimized code, as the > > > same const0_rtx gets CSE'd in postreload. > > > > > > I can't be certain until I try the experiment, but I believe that the > > > postreload CSE will clean-up, all of the same common subexpressions. > > > Hence, it should be safe to perform all STV at the same point (after > > > combine), which for a few additional optimizations. > > > > > > Does this make sense? Do you have a test case, > > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV > > chains? > > > > > > My guess is that the RTL passes have changed so much in the last six > > > or seven years, that some of the original motivation no longer applies. > > > Certainly we now try to keep TI mode operations visible longer, and > > > then allow STV to behave like a pre-reload pass to decide which set of > > > registers to use (vector V1TI or scalar doubleword DI). Any CSE > > > opportunities that cse2 finds with V1TI mode, could/should equally > > > well be found for TI mode (mostly). > > > > You are probably right. If there are no regressions in GCC testsuite, my original > > motivation is no longer valid. > > It was good to try the experiment, but H.J. is right, there is still some benefit > (as well as some disadvantages) to running STV lowering before CSE2/combine. > A clean-up patch to perform all STV conversion as a single pass (removing a > pass from the compiler) results in just a single regression in the test suite: > FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8 > which looks like: > > __int128 a, b, c, d, e, f; > void foo (void) > { > a = 0; > b = -1; > c = 0; > d = -1; > e = 0; > f = -1; > } > > By performing STV after combine (without CSE), reload prefers to implement > this function using a single register, that then requires 12 instructions rather > than 8 (if using two registers). Alas there's nothing that postreload CSE/GCSE > can do. Doh! Hmm, the RA could be taught to make use of more of the register file I suppose (shouldn't regrename do this job - but it runs after postreload-cse) > pxor %xmm0, %xmm0 > movaps %xmm0, a(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, b(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, c(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, d(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, e(%rip) > pcmpeqd %xmm0, %xmm0 > movaps %xmm0, f(%rip) > ret > > I also note that even without STV, the scalar implementation of this function when > compiled with -Os is also larger than it needs to be due to poor CSE (notice in the > following we only need a single zero register, and an all_ones reg would be helpful). > > xorl %eax, %eax > xorl %edx, %edx > xorl %ecx, %ecx > movq $-1, b(%rip) > movq %rax, a(%rip) > movq %rax, a+8(%rip) > movq $-1, b+8(%rip) > movq %rdx, c(%rip) > movq %rdx, c+8(%rip) > movq $-1, d(%rip) > movq $-1, d+8(%rip) > movq %rcx, e(%rip) > movq %rcx, e+8(%rip) > movq $-1, f(%rip) > movq $-1, f+8(%rip) > ret > > I need to give the problem some more thought. It would be good to clean-up/unify > the STV passes, but I/we need to solve/CSE HJ's last test case before we do. Perhaps > by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero > register to become visible, and CSE'd, benefiting both vector code and scalar -Os code, > then use postreload/peephole2 to fix up the remaining scalar cases. It's tricky. Not sure if related but ppc(?) folks recently tried to massage CSE to avoid propagating constants by making sure that rtx_cost handles (set (...) (const_int ...)) "properly". But IIRC CSE never does the reverse transform - split out a constant to a pseudo from multiple uses of the same constant - that's probably on the job of reload + postreload-CSE right now, but reload probably does not know that there are multiple uses of the constant so the splitting is worthwhile. > Cheers, > Roger > -- > >