From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com
 [IPv6:2607:f8b0:4864:20::62a])
 by sourceware.org (Postfix) with ESMTPS id 8939B3858C2F
 for <gcc-patches@gcc.gnu.org>; Sun, 10 Jul 2022 23:56:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8939B3858C2F
Received: by mail-pl1-x62a.google.com with SMTP id j12so3090211plj.8
 for <gcc-patches@gcc.gnu.org>; Sun, 10 Jul 2022 16:56:40 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=KBIvkgqwl7N++pj9Rl/qMtwqtSzBehwk3Q61c2ehD/w=;
 b=Z6Si5NXwnDqJhdyY7+FiEnp2RHzpS5CWoXpLEQGes0lAEUhWO4ILaZwhOr/K+e0sp4
 78NDk+4vnHbwsMwLOYBjeHHMQ1ZYYOz7l6N37yRfriNIST56anuySXy4MgH/w7GcOEWl
 45zIo2Z/b35AVuBDt8dHr2eAGWcZ5pdYNDMc/w3ITGFD+rR1EWbxaNFoc0L0kYToPqNF
 fgOG4m9LQTrWipUGdjUU7FMlh040ok5wudBST6fGi6PZvgqqv6OgJgeaNQ4TVO8X0xCC
 1r2xAgRP4bAkzDWuf6jDNPoqo9yTWJiv4A6TAKz8y9lAKGURiYthGNTBdD+nKLXSffQO
 vp0g==
X-Gm-Message-State: AJIora+7cVkBaxmrOPxY2QEsCSqRPDmwiYndTXMpuLEsF9GFjfr3CVmD
 TWgNQLPW8BHvnHPMXtx6Ck3RvNQH+4glTbOIWVk=
X-Google-Smtp-Source: AGRyM1uFuuTY+ksPr9bbqgBobL6Kl9SlHo8I/f5Y6R19Nfq+iWOoNo/AZahaR+NWR3Gu9NlOCVVz+8Q2T9SeBGMpJFg=
X-Received: by 2002:a17:902:b215:b0:168:da4b:c925 with SMTP id
 t21-20020a170902b21500b00168da4bc925mr15244342plr.155.1657497399410; Sun, 10
 Jul 2022 16:56:39 -0700 (PDT)
MIME-Version: 1.0
References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com>
 <CAFULd4YY=ZG089XjQjrxmDR2Hy_ZD0eoXsjfVFz0xe=JioxhcQ@mail.gmail.com>
 <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com>
 <CAMe9rOrYp5pmU26NpcbV-f16tpJ6CVvtc5Szvp2TSW3egu0CyA@mail.gmail.com>
 <014a01d894a5$71189220$5349b660$@nextmovesoftware.com>
In-Reply-To: <014a01d894a5$71189220$5349b660$@nextmovesoftware.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sun, 10 Jul 2022 16:56:03 -0700
Message-ID: <CAMe9rOpds-iMyOLD9C5ZAzg2KQ=Mc8udAhYJRrY=4wAbYR=Kyw@mail.gmail.com>
Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode
 to V1TImode.
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: Uros Bizjak <ubizjak@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3018.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Jul 2022 23:56:43 -0000

On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi HJ,
>
> I believe this should now be handled by the post-reload (CSE) pass.
> Consider the simple test case:
>
> __int128 a, b, c;
> void foo()
> {
>   a = 0;
>   b = 0;
>   c = 0;
> }
>
> Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
>         movq    $0, a(%rip)
>         movq    $0, a+8(%rip)
>         movq    $0, b(%rip)
>         movq    $0, b+8(%rip)
>         movq    $0, c(%rip)
>         movq    $0, c+8(%rip)
>         ret
>
> But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         movaps  %xmm0, b(%rip)
>         movaps  %xmm0, c(%rip)
>         ret
>
> You're quite right internally the STV actually generates the equivalent of:
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         ret
>
> And currently because STV run before cse2 and combine, the const0_rtx
> gets CSE'd be the cse2 pass to produce the code we see.  However, if you
> specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we
> continue to generate the same optimized code, as the same const0_rtx
> gets CSE'd in postreload.
>
> I can't be certain until I try the experiment, but I believe that the postreload
> CSE will clean-up, all of the same common subexpressions.  Hence, it should
> be safe to perform all STV at the same point (after combine), which for a few
> additional optimizations.
>
> Does this make sense?  Do you have a test case, -fno-rerun-cse-after-loop
> produces different/inferior code for TImode STV chains?
>
> My guess is that the RTL passes have changed so much in the last six or
> seven years, that some of the original motivation no longer applies.
> Certainly we now try to keep TI mode operations visible longer, and
> then allow STV to behave like a pre-reload pass to decide which set of
> registers to use (vector V1TI or scalar doubleword DI).  Any CSE opportunities
> that cse2 finds with V1TI mode, could/should equally well be found for
> TI mode (mostly).

You are probably right.  If there are no regressions in GCC testsuite,
my original motivation is no longer valid.

Thanks.

> Cheers,
> Roger
> --
>
> > -----Original Message-----
> > From: H.J. Lu <hjl.tools@gmail.com>
> > Sent: 10 July 2022 20:15
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> > TImode to V1TImode.
> >
> > On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Hi Uros,
> > > Yes, I agree.  I think it makes sense to have a single STV pass (after
> > > combine and before reload).  Let's hear what HJ thinks, but I'm happy
> > > to investigate a follow-up patch that unifies the STV passes.
> > > But it'll be easier to confirm there are no "code generation" changes
> > > if those modifications are pushed independently of these ones.
> > > Time to look into the (git) history of multiple STV passes...
> > >
> > > Thanks for the review.  I'll wait for HJ's thoughts.
> >
> > The TImode STV pass is run before the CSE pass so that instructions changed or
> > generated by the STV pass can be CSEed.
> >
> > > Cheers,
> > > Roger
> > > --
> > >
> > > > -----Original Message-----
> > > > From: Uros Bizjak <ubizjak@gmail.com>
> > > > Sent: 10 July 2022 19:06
> > > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support
> > > > for TImode to V1TImode.
> > > >
> > > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > > <roger@nextmovesoftware.com>
> > > > wrote:
> > > > >
> > > > >
> > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > > aggressively transform 128-bit scalar TImode operations into
> > > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > > functionality already exists in STV, but only for move operations,
> > > > > this changes brings support for logical operations (AND, IOR, XOR,
> > > > > NOT and ANDN) and comparisons.
> > > > >
> > > > > The effect of these changes are conveniently demonstrated by the
> > > > > new sse4_1-stv-5.c test case:
> > > > >
> > > > > __int128 a[16];
> > > > > __int128 b[16];
> > > > > __int128 c[16];
> > > > >
> > > > > void foo()
> > > > > {
> > > > >   for (unsigned int i=0; i<16; i++)
> > > > >     a[i] = b[i] & ~c[i];
> > > > > }
> > > > >
> > > > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movq    c(%rax), %rsi
> > > > >         movq    c+8(%rax), %rdi
> > > > >         addq    $16, %rax
> > > > >         notq    %rsi
> > > > >         notq    %rdi
> > > > >         andq    b-16(%rax), %rsi
> > > > >         andq    b-8(%rax), %rdi
> > > > >         movq    %rsi, a-16(%rax)
> > > > >         movq    %rdi, a-8(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > but with this patch now produces:
> > > > >
> > > > > foo:    xorl    %eax, %eax
> > > > > .L2:    movdqa  c(%rax), %xmm0
> > > > >         pandn   b(%rax), %xmm0
> > > > >         addq    $16, %rax
> > > > >         movaps  %xmm0, a-16(%rax)
> > > > >         cmpq    $256, %rax
> > > > >         jne     .L2
> > > > >         ret
> > > > >
> > > > > Technically, the STV pass is implemented by three C++ classes, a
> > > > > common abstract base class "scalar_chain" that contains common
> > > > > functionality, and two derived classes: general_scalar_chain
> > > > > (which handles SI and DI modes) and timode_scalar_chain (which
> > > > > handles TI modes).  As mentioned previously, because only TI mode
> > > > > moves were handled the two worker classes behaved significantly
> > differently.
> > > > > These changes bring the functionality of these two classes closer
> > > > > together, which is reflected by refactoring more shared code from
> > > > > general_scalar_chain to the parent scalar_chain and reusing it
> > > > > from timode.  There still remain significant differences (and
> > > > > simplifications) so the existing division of classes (as
> > > > > specializations) continues
> > > > to make sense.
> > > >
> > > > Please note that there are in fact two STV passes, one before
> > > > combine and the other after combine. The TImode pass that previously
> > > > handled only loads and stores is positioned before combine (there
> > > > was a reason for this decision, but I don't remember the details -
> > > > let's ask HJ...). However, DImode STV pass transforms much more
> > > > instructions and the reason it was positioned after the combine pass
> > > > was that STV pass transforms optimized insn stream where forward
> > propagation was already performed.
> > > >
> > > > What is not clear to me from the above explanation is: is the new
> > > > TImode STV pass positioned after the combine pass, and if this is
> > > > the case, how the change affects current load/store TImode STV pass.
> > > > I must admit, I don't like two separate STV passess, so if TImode is
> > > > now similar to DImode, I suggest we abandon STV1 pass and do
> > > > everything concerning TImode after the combine pass. HJ, what is your
> > opinion on this?
> > > >
> > > > Other than the above, the patch LGTM to me.
> > > >
> > > > Uros.
> > > >
> > > > > Obviously, there are more changes to come (shifts and rotates),
> > > > > and compute_convert_gain doesn't yet have its final (tuned) form,
> > > > > but is already an improvement over the "return 1;" used previously.
> > > > >
> > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > boostrap and make -k check, both with and without
> > > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > > >
> > > > >
> > > > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > > > >
> > > > > gcc/ChangeLog
> > > > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > > > >         insns_conv, n_sse_to_integer and n_integer_to_sse to this
> > > > >         parent class, moved from general_scalar_chain.
> > > > >         (scalar_chain::convert_compare): Protected method moved
> > > > >         from general_scalar_chain.
> > > > >         (mark_dual_mode_def): Make protected, not private virtual.
> > > > >         (scalar_chain:convert_op): New private virtual method.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Simplify constructor.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >         (general_scalar_chain): Move insns_conv, n_sse_to_integer and
> > > > >         n_integer_to_sse fields to parent class, scalar_chain.
> > > > >         (general_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (general_scalar_chain::convert_compare): Delete prototype.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): Remove simplistic
> > > > >         implementation, convert to a method prototype.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete prototype.
> > > > >         (timode_scalar_chain::convert_op): Prototype new virtual method.
> > > > >
> > > > >         * config/i386/i386-features.cc (scalar_chain::scalar_chain):
> > > > >         Allocate insns_conv and initialize n_sse_to_integer and
> > > > >         n_integer_to_sse fields in constructor.
> > > > >         (scalar_chain::scalar_chain): Free insns_conv in destructor.
> > > > >
> > > > >         (general_scalar_chain::general_scalar_chain): Delete
> > > > >         constructor, now defined in the class declaration.
> > > > >         (general_scalar_chain::~general_scalar_chain): Delete destructor.
> > > > >
> > > > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > > > >         general_scalar_chain::mark_dual_mode_def.
> > > > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > > > >         (scalar_chain::convert_compare): Renamed from
> > > > >         general_scalar_chain::convert_compare.
> > > > >
> > > > >         (timode_scalar_chain::compute_convert_gain): New method to
> > > > >         determine the gain from converting a TImode chain to V1TImode.
> > > > >         (timode_scalar_chain::convert_op): New method to convert an
> > > > >         operand from TImode to V1TImode.
> > > > >
> > > > >         (timode_scalar_chain::convert_insn) <case REG>: Only PUT_MODE
> > > > >         on REG_EQUAL notes that were originally TImode (not CONST_INT).
> > > > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > > > >         (timode_mem_p): Helper predicate to check where operand is
> > > > >         memory reference with sufficient alignment for TImode STV.
> > > > >         (timode_scalar_to_vector_candidate_p): Use
> > convertible_comparison_p
> > > > >         to check whether COMPARE is convertible.  Handle SET_DESTs that
> > > > >         that are REG_P or MEM_P and SET_SRCs that are REG, CONST_INT,
> > > > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > > > >
> > > > > gcc/testsuite/ChangeLog
> > > > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > > > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > > > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > > > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > > > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > > > >
> > > > > Roger
> > > > > --
> > > > >
> > >
> >
> >
> > --
> > H.J.
>


-- 
H.J.