From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by sourceware.org (Postfix) with ESMTPS id 8939B3858C2F for ; Sun, 10 Jul 2022 23:56:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8939B3858C2F Received: by mail-pl1-x62a.google.com with SMTP id j12so3090211plj.8 for ; Sun, 10 Jul 2022 16:56:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KBIvkgqwl7N++pj9Rl/qMtwqtSzBehwk3Q61c2ehD/w=; b=Z6Si5NXwnDqJhdyY7+FiEnp2RHzpS5CWoXpLEQGes0lAEUhWO4ILaZwhOr/K+e0sp4 78NDk+4vnHbwsMwLOYBjeHHMQ1ZYYOz7l6N37yRfriNIST56anuySXy4MgH/w7GcOEWl 45zIo2Z/b35AVuBDt8dHr2eAGWcZ5pdYNDMc/w3ITGFD+rR1EWbxaNFoc0L0kYToPqNF fgOG4m9LQTrWipUGdjUU7FMlh040ok5wudBST6fGi6PZvgqqv6OgJgeaNQ4TVO8X0xCC 1r2xAgRP4bAkzDWuf6jDNPoqo9yTWJiv4A6TAKz8y9lAKGURiYthGNTBdD+nKLXSffQO vp0g== X-Gm-Message-State: AJIora+7cVkBaxmrOPxY2QEsCSqRPDmwiYndTXMpuLEsF9GFjfr3CVmD TWgNQLPW8BHvnHPMXtx6Ck3RvNQH+4glTbOIWVk= X-Google-Smtp-Source: AGRyM1uFuuTY+ksPr9bbqgBobL6Kl9SlHo8I/f5Y6R19Nfq+iWOoNo/AZahaR+NWR3Gu9NlOCVVz+8Q2T9SeBGMpJFg= X-Received: by 2002:a17:902:b215:b0:168:da4b:c925 with SMTP id t21-20020a170902b21500b00168da4bc925mr15244342plr.155.1657497399410; Sun, 10 Jul 2022 16:56:39 -0700 (PDT) MIME-Version: 1.0 References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com> <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com> <014a01d894a5$71189220$5349b660$@nextmovesoftware.com> In-Reply-To: <014a01d894a5$71189220$5349b660$@nextmovesoftware.com> From: "H.J. Lu" Date: Sun, 10 Jul 2022 16:56:03 -0700 Message-ID: Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode. To: Roger Sayle Cc: Uros Bizjak , GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3018.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Jul 2022 23:56:43 -0000 On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle wrote: > > > Hi HJ, > > I believe this should now be handled by the post-reload (CSE) pass. > Consider the simple test case: > > __int128 a, b, c; > void foo() > { > a = 0; > b = 0; > c = 0; > } > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > movq $0, a(%rip) > movq $0, a+8(%rip) > movq $0, b(%rip) > movq $0, b+8(%rip) > movq $0, c(%rip) > movq $0, c+8(%rip) > ret > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > pxor %xmm0, %xmm0 > movaps %xmm0, a(%rip) > movaps %xmm0, b(%rip) > movaps %xmm0, c(%rip) > ret > > You're quite right internally the STV actually generates the equivalent of: > pxor %xmm0, %xmm0 > movaps %xmm0, a(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, b(%rip) > pxor %xmm0, %xmm0 > movaps %xmm0, c(%rip) > ret > > And currently because STV run before cse2 and combine, the const0_rtx > gets CSE'd be the cse2 pass to produce the code we see. However, if you > specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we > continue to generate the same optimized code, as the same const0_rtx > gets CSE'd in postreload. > > I can't be certain until I try the experiment, but I believe that the postreload > CSE will clean-up, all of the same common subexpressions. Hence, it should > be safe to perform all STV at the same point (after combine), which for a few > additional optimizations. > > Does this make sense? Do you have a test case, -fno-rerun-cse-after-loop > produces different/inferior code for TImode STV chains? > > My guess is that the RTL passes have changed so much in the last six or > seven years, that some of the original motivation no longer applies. > Certainly we now try to keep TI mode operations visible longer, and > then allow STV to behave like a pre-reload pass to decide which set of > registers to use (vector V1TI or scalar doubleword DI). Any CSE opportunities > that cse2 finds with V1TI mode, could/should equally well be found for > TI mode (mostly). You are probably right. If there are no regressions in GCC testsuite, my original motivation is no longer valid. Thanks. > Cheers, > Roger > -- > > > -----Original Message----- > > From: H.J. Lu > > Sent: 10 July 2022 20:15 > > To: Roger Sayle > > Cc: Uros Bizjak ; GCC Patches > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for > > TImode to V1TImode. > > > > On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle > > wrote: > > > > > > > > > Hi Uros, > > > Yes, I agree. I think it makes sense to have a single STV pass (after > > > combine and before reload). Let's hear what HJ thinks, but I'm happy > > > to investigate a follow-up patch that unifies the STV passes. > > > But it'll be easier to confirm there are no "code generation" changes > > > if those modifications are pushed independently of these ones. > > > Time to look into the (git) history of multiple STV passes... > > > > > > Thanks for the review. I'll wait for HJ's thoughts. > > > > The TImode STV pass is run before the CSE pass so that instructions changed or > > generated by the STV pass can be CSEed. > > > > > Cheers, > > > Roger > > > -- > > > > > > > -----Original Message----- > > > > From: Uros Bizjak > > > > Sent: 10 July 2022 19:06 > > > > To: Roger Sayle > > > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu > > > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support > > > > for TImode to V1TImode. > > > > > > > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle > > > > > > > > wrote: > > > > > > > > > > > > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more > > > > > aggressively transform 128-bit scalar TImode operations into > > > > > vector V1TImode operations performed on SSE registers. TImode > > > > > functionality already exists in STV, but only for move operations, > > > > > this changes brings support for logical operations (AND, IOR, XOR, > > > > > NOT and ANDN) and comparisons. > > > > > > > > > > The effect of these changes are conveniently demonstrated by the > > > > > new sse4_1-stv-5.c test case: > > > > > > > > > > __int128 a[16]; > > > > > __int128 b[16]; > > > > > __int128 c[16]; > > > > > > > > > > void foo() > > > > > { > > > > > for (unsigned int i=0; i<16; i++) > > > > > a[i] = b[i] & ~c[i]; > > > > > } > > > > > > > > > > which when currently compiled on mainline wtih -O2 -msse4 produces: > > > > > > > > > > foo: xorl %eax, %eax > > > > > .L2: movq c(%rax), %rsi > > > > > movq c+8(%rax), %rdi > > > > > addq $16, %rax > > > > > notq %rsi > > > > > notq %rdi > > > > > andq b-16(%rax), %rsi > > > > > andq b-8(%rax), %rdi > > > > > movq %rsi, a-16(%rax) > > > > > movq %rdi, a-8(%rax) > > > > > cmpq $256, %rax > > > > > jne .L2 > > > > > ret > > > > > > > > > > but with this patch now produces: > > > > > > > > > > foo: xorl %eax, %eax > > > > > .L2: movdqa c(%rax), %xmm0 > > > > > pandn b(%rax), %xmm0 > > > > > addq $16, %rax > > > > > movaps %xmm0, a-16(%rax) > > > > > cmpq $256, %rax > > > > > jne .L2 > > > > > ret > > > > > > > > > > Technically, the STV pass is implemented by three C++ classes, a > > > > > common abstract base class "scalar_chain" that contains common > > > > > functionality, and two derived classes: general_scalar_chain > > > > > (which handles SI and DI modes) and timode_scalar_chain (which > > > > > handles TI modes). As mentioned previously, because only TI mode > > > > > moves were handled the two worker classes behaved significantly > > differently. > > > > > These changes bring the functionality of these two classes closer > > > > > together, which is reflected by refactoring more shared code from > > > > > general_scalar_chain to the parent scalar_chain and reusing it > > > > > from timode. There still remain significant differences (and > > > > > simplifications) so the existing division of classes (as > > > > > specializations) continues > > > > to make sense. > > > > > > > > Please note that there are in fact two STV passes, one before > > > > combine and the other after combine. The TImode pass that previously > > > > handled only loads and stores is positioned before combine (there > > > > was a reason for this decision, but I don't remember the details - > > > > let's ask HJ...). However, DImode STV pass transforms much more > > > > instructions and the reason it was positioned after the combine pass > > > > was that STV pass transforms optimized insn stream where forward > > propagation was already performed. > > > > > > > > What is not clear to me from the above explanation is: is the new > > > > TImode STV pass positioned after the combine pass, and if this is > > > > the case, how the change affects current load/store TImode STV pass. > > > > I must admit, I don't like two separate STV passess, so if TImode is > > > > now similar to DImode, I suggest we abandon STV1 pass and do > > > > everything concerning TImode after the combine pass. HJ, what is your > > opinion on this? > > > > > > > > Other than the above, the patch LGTM to me. > > > > > > > > Uros. > > > > > > > > > Obviously, there are more changes to come (shifts and rotates), > > > > > and compute_convert_gain doesn't yet have its final (tuned) form, > > > > > but is already an improvement over the "return 1;" used previously. > > > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > > boostrap and make -k check, both with and without > > > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > > > > > > > > > > > > > 2022-07-09 Roger Sayle > > > > > > > > > > gcc/ChangeLog > > > > > * config/i386/i386-features.h (scalar_chain): Add fields > > > > > insns_conv, n_sse_to_integer and n_integer_to_sse to this > > > > > parent class, moved from general_scalar_chain. > > > > > (scalar_chain::convert_compare): Protected method moved > > > > > from general_scalar_chain. > > > > > (mark_dual_mode_def): Make protected, not private virtual. > > > > > (scalar_chain:convert_op): New private virtual method. > > > > > > > > > > (general_scalar_chain::general_scalar_chain): Simplify constructor. > > > > > (general_scalar_chain::~general_scalar_chain): Delete destructor. > > > > > (general_scalar_chain): Move insns_conv, n_sse_to_integer and > > > > > n_integer_to_sse fields to parent class, scalar_chain. > > > > > (general_scalar_chain::mark_dual_mode_def): Delete prototype. > > > > > (general_scalar_chain::convert_compare): Delete prototype. > > > > > > > > > > (timode_scalar_chain::compute_convert_gain): Remove simplistic > > > > > implementation, convert to a method prototype. > > > > > (timode_scalar_chain::mark_dual_mode_def): Delete prototype. > > > > > (timode_scalar_chain::convert_op): Prototype new virtual method. > > > > > > > > > > * config/i386/i386-features.cc (scalar_chain::scalar_chain): > > > > > Allocate insns_conv and initialize n_sse_to_integer and > > > > > n_integer_to_sse fields in constructor. > > > > > (scalar_chain::scalar_chain): Free insns_conv in destructor. > > > > > > > > > > (general_scalar_chain::general_scalar_chain): Delete > > > > > constructor, now defined in the class declaration. > > > > > (general_scalar_chain::~general_scalar_chain): Delete destructor. > > > > > > > > > > (scalar_chain::mark_dual_mode_def): Renamed from > > > > > general_scalar_chain::mark_dual_mode_def. > > > > > (timode_scalar_chain::mark_dual_mode_def): Delete. > > > > > (scalar_chain::convert_compare): Renamed from > > > > > general_scalar_chain::convert_compare. > > > > > > > > > > (timode_scalar_chain::compute_convert_gain): New method to > > > > > determine the gain from converting a TImode chain to V1TImode. > > > > > (timode_scalar_chain::convert_op): New method to convert an > > > > > operand from TImode to V1TImode. > > > > > > > > > > (timode_scalar_chain::convert_insn) : Only PUT_MODE > > > > > on REG_EQUAL notes that were originally TImode (not CONST_INT). > > > > > Handle AND, ANDN, XOR, IOR, NOT and COMPARE. > > > > > (timode_mem_p): Helper predicate to check where operand is > > > > > memory reference with sufficient alignment for TImode STV. > > > > > (timode_scalar_to_vector_candidate_p): Use > > convertible_comparison_p > > > > > to check whether COMPARE is convertible. Handle SET_DESTs that > > > > > that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT, > > > > > CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT. > > > > > > > > > > gcc/testsuite/ChangeLog > > > > > * gcc.target/i386/sse4_1-stv-2.c: New test case, pand. > > > > > * gcc.target/i386/sse4_1-stv-3.c: New test case, por. > > > > > * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor. > > > > > * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn. > > > > > * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest. > > > > > > > > > > Roger > > > > > -- > > > > > > > > > > > > > > -- > > H.J. > -- H.J.