From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <roger@nextmovesoftware.com>
Received: from server.nextmovesoftware.com (server.nextmovesoftware.com
 [162.254.253.69])
 by sourceware.org (Postfix) with ESMTPS id 2F7293857368
 for <gcc-patches@gcc.gnu.org>; Sun, 10 Jul 2022 21:38:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2F7293857368
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=nextmovesoftware.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=nextmovesoftware.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=nextmovesoftware.com; s=default; h=Content-Transfer-Encoding:Content-Type:
 MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:Cc:To:From:Sender
 :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:
 Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
 List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=3KJhKK8z0x9tWBZRdrT9Fp+AZUxQFU4lkqOqWnf+Ryo=; b=SiOmKLQj1GZhkxLsrd4bwejUGF
 rIUKi+okBkL15jnPqI1CAMceCMx3JeQ2gGOlMWifdkE+Nszljh09onnRKpMXjhTa3Awng+Em1KPL+
 epMvcb7RM5XzAe6HuuaNeKCmX7p8MwrxXz+JNwyyezbqsmZMDbhW6LxvYhyyvfD7Vn/AxGAyF4Rp9
 kQIHeGJ0fCkS9FthzvBmJzv3H1TTKTnLlKSnetd6Y958AmWLDH5UxGfMWtzx2FWj7JHi93n4ajK1u
 1vn2oOQ1NaWWhghg/3HjcA7Cb3oiVdLmHyfpOeuNDK/wr0KsI0Kt5j8A35Y6Sg6lOgOJGdchviU6F
 pjTn8vhw==;
Received: from host86-130-134-60.range86-130.btcentralplus.com
 ([86.130.134.60]:53904 helo=Dell)
 by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <roger@nextmovesoftware.com>)
 id 1oAede-0005XG-GU; Sun, 10 Jul 2022 17:38:50 -0400
From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'H.J. Lu'" <hjl.tools@gmail.com>
Cc: "'Uros Bizjak'" <ubizjak@gmail.com>,
 "'GCC Patches'" <gcc-patches@gcc.gnu.org>
References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com>
 <CAFULd4YY=ZG089XjQjrxmDR2Hy_ZD0eoXsjfVFz0xe=JioxhcQ@mail.gmail.com>
 <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com>
 <CAMe9rOrYp5pmU26NpcbV-f16tpJ6CVvtc5Szvp2TSW3egu0CyA@mail.gmail.com>
In-Reply-To: <CAMe9rOrYp5pmU26NpcbV-f16tpJ6CVvtc5Szvp2TSW3egu0CyA@mail.gmail.com>
Subject: RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode
 to V1TImode.
Date: Sun, 10 Jul 2022 22:38:47 +0100
Message-ID: <014a01d894a5$71189220$5349b660$@nextmovesoftware.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQDLEU72DOSSolIGHYPmM3CDaDY5HQFm5hbmAaCooigCBX1m0q9qtjfQ
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com
X-AntiAbuse: Original Domain - gcc.gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - nextmovesoftware.com
X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id:
 roger@nextmovesoftware.com
X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, SPF_HELO_NONE, SPF_PASS,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Jul 2022 21:38:53 -0000


Hi HJ,

I believe this should now be handled by the post-reload (CSE) pass.
Consider the simple test case:

__int128 a, b, c;
void foo()
{
  a =3D 0;
  b =3D 0;
  c =3D 0;
}

Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
        movq    $0, a(%rip)
        movq    $0, a+8(%rip)
        movq    $0, b(%rip)
        movq    $0, b+8(%rip)
        movq    $0, c(%rip)
        movq    $0, c+8(%rip)
        ret

But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        movaps  %xmm0, b(%rip)
        movaps  %xmm0, c(%rip)
        ret

You're quite right internally the STV actually generates the equivalent =
of:
        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        ret

And currently because STV run before cse2 and combine, the const0_rtx
gets CSE'd be the cse2 pass to produce the code we see.  However, if you
specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see =
we
continue to generate the same optimized code, as the same const0_rtx
gets CSE'd in postreload.

I can't be certain until I try the experiment, but I believe that the =
postreload
CSE will clean-up, all of the same common subexpressions.  Hence, it =
should
be safe to perform all STV at the same point (after combine), which for =
a few
additional optimizations.

Does this make sense?  Do you have a test case, =
-fno-rerun-cse-after-loop
produces different/inferior code for TImode STV chains?

My guess is that the RTL passes have changed so much in the last six or
seven years, that some of the original motivation no longer applies.
Certainly we now try to keep TI mode operations visible longer, and
then allow STV to behave like a pre-reload pass to decide which set of
registers to use (vector V1TI or scalar doubleword DI).  Any CSE =
opportunities
that cse2 finds with V1TI mode, could/should equally well be found for
TI mode (mostly).

Cheers,
Roger
--

> -----Original Message-----
> From: H.J. Lu <hjl.tools@gmail.com>
> Sent: 10 July 2022 20:15
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: Uros Bizjak <ubizjak@gmail.com>; GCC Patches =
<gcc-patches@gcc.gnu.org>
> Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support =
for
> TImode to V1TImode.
>=20
> On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle =
<roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Hi Uros,
> > Yes, I agree.  I think it makes sense to have a single STV pass =
(after
> > combine and before reload).  Let's hear what HJ thinks, but I'm =
happy
> > to investigate a follow-up patch that unifies the STV passes.
> > But it'll be easier to confirm there are no "code generation" =
changes
> > if those modifications are pushed independently of these ones.
> > Time to look into the (git) history of multiple STV passes...
> >
> > Thanks for the review.  I'll wait for HJ's thoughts.
>=20
> The TImode STV pass is run before the CSE pass so that instructions =
changed or
> generated by the STV pass can be CSEed.
>=20
> > Cheers,
> > Roger
> > --
> >
> > > -----Original Message-----
> > > From: Uros Bizjak <ubizjak@gmail.com>
> > > Sent: 10 July 2022 19:06
> > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu <hjl.tools@gmail.com>
> > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) =
support
> > > for TImode to V1TImode.
> > >
> > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > <roger@nextmovesoftware.com>
> > > wrote:
> > > >
> > > >
> > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > aggressively transform 128-bit scalar TImode operations into
> > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > functionality already exists in STV, but only for move =
operations,
> > > > this changes brings support for logical operations (AND, IOR, =
XOR,
> > > > NOT and ANDN) and comparisons.
> > > >
> > > > The effect of these changes are conveniently demonstrated by the
> > > > new sse4_1-stv-5.c test case:
> > > >
> > > > __int128 a[16];
> > > > __int128 b[16];
> > > > __int128 c[16];
> > > >
> > > > void foo()
> > > > {
> > > >   for (unsigned int i=3D0; i<16; i++)
> > > >     a[i] =3D b[i] & ~c[i];
> > > > }
> > > >
> > > > which when currently compiled on mainline wtih -O2 -msse4 =
produces:
> > > >
> > > > foo:    xorl    %eax, %eax
> > > > .L2:    movq    c(%rax), %rsi
> > > >         movq    c+8(%rax), %rdi
> > > >         addq    $16, %rax
> > > >         notq    %rsi
> > > >         notq    %rdi
> > > >         andq    b-16(%rax), %rsi
> > > >         andq    b-8(%rax), %rdi
> > > >         movq    %rsi, a-16(%rax)
> > > >         movq    %rdi, a-8(%rax)
> > > >         cmpq    $256, %rax
> > > >         jne     .L2
> > > >         ret
> > > >
> > > > but with this patch now produces:
> > > >
> > > > foo:    xorl    %eax, %eax
> > > > .L2:    movdqa  c(%rax), %xmm0
> > > >         pandn   b(%rax), %xmm0
> > > >         addq    $16, %rax
> > > >         movaps  %xmm0, a-16(%rax)
> > > >         cmpq    $256, %rax
> > > >         jne     .L2
> > > >         ret
> > > >
> > > > Technically, the STV pass is implemented by three C++ classes, a
> > > > common abstract base class "scalar_chain" that contains common
> > > > functionality, and two derived classes: general_scalar_chain
> > > > (which handles SI and DI modes) and timode_scalar_chain (which
> > > > handles TI modes).  As mentioned previously, because only TI =
mode
> > > > moves were handled the two worker classes behaved significantly
> differently.
> > > > These changes bring the functionality of these two classes =
closer
> > > > together, which is reflected by refactoring more shared code =
from
> > > > general_scalar_chain to the parent scalar_chain and reusing it
> > > > from timode.  There still remain significant differences (and
> > > > simplifications) so the existing division of classes (as
> > > > specializations) continues
> > > to make sense.
> > >
> > > Please note that there are in fact two STV passes, one before
> > > combine and the other after combine. The TImode pass that =
previously
> > > handled only loads and stores is positioned before combine (there
> > > was a reason for this decision, but I don't remember the details -
> > > let's ask HJ...). However, DImode STV pass transforms much more
> > > instructions and the reason it was positioned after the combine =
pass
> > > was that STV pass transforms optimized insn stream where forward
> propagation was already performed.
> > >
> > > What is not clear to me from the above explanation is: is the new
> > > TImode STV pass positioned after the combine pass, and if this is
> > > the case, how the change affects current load/store TImode STV =
pass.
> > > I must admit, I don't like two separate STV passess, so if TImode =
is
> > > now similar to DImode, I suggest we abandon STV1 pass and do
> > > everything concerning TImode after the combine pass. HJ, what is =
your
> opinion on this?
> > >
> > > Other than the above, the patch LGTM to me.
> > >
> > > Uros.
> > >
> > > > Obviously, there are more changes to come (shifts and rotates),
> > > > and compute_convert_gain doesn't yet have its final (tuned) =
form,
> > > > but is already an improvement over the "return 1;" used =
previously.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > boostrap and make -k check, both with and without
> > > > --target_board=3Dunix{-m32} with no new failures.  Ok for =
mainline?
> > > >
> > > >
> > > > 2022-07-09  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/i386-features.h (scalar_chain): Add fields
> > > >         insns_conv, n_sse_to_integer and n_integer_to_sse to =
this
> > > >         parent class, moved from general_scalar_chain.
> > > >         (scalar_chain::convert_compare): Protected method moved
> > > >         from general_scalar_chain.
> > > >         (mark_dual_mode_def): Make protected, not private =
virtual.
> > > >         (scalar_chain:convert_op): New private virtual method.
> > > >
> > > >         (general_scalar_chain::general_scalar_chain): Simplify =
constructor.
> > > >         (general_scalar_chain::~general_scalar_chain): Delete =
destructor.
> > > >         (general_scalar_chain): Move insns_conv, =
n_sse_to_integer and
> > > >         n_integer_to_sse fields to parent class, scalar_chain.
> > > >         (general_scalar_chain::mark_dual_mode_def): Delete =
prototype.
> > > >         (general_scalar_chain::convert_compare): Delete =
prototype.
> > > >
> > > >         (timode_scalar_chain::compute_convert_gain): Remove =
simplistic
> > > >         implementation, convert to a method prototype.
> > > >         (timode_scalar_chain::mark_dual_mode_def): Delete =
prototype.
> > > >         (timode_scalar_chain::convert_op): Prototype new virtual =
method.
> > > >
> > > >         * config/i386/i386-features.cc =
(scalar_chain::scalar_chain):
> > > >         Allocate insns_conv and initialize n_sse_to_integer and
> > > >         n_integer_to_sse fields in constructor.
> > > >         (scalar_chain::scalar_chain): Free insns_conv in =
destructor.
> > > >
> > > >         (general_scalar_chain::general_scalar_chain): Delete
> > > >         constructor, now defined in the class declaration.
> > > >         (general_scalar_chain::~general_scalar_chain): Delete =
destructor.
> > > >
> > > >         (scalar_chain::mark_dual_mode_def): Renamed from
> > > >         general_scalar_chain::mark_dual_mode_def.
> > > >         (timode_scalar_chain::mark_dual_mode_def): Delete.
> > > >         (scalar_chain::convert_compare): Renamed from
> > > >         general_scalar_chain::convert_compare.
> > > >
> > > >         (timode_scalar_chain::compute_convert_gain): New method =
to
> > > >         determine the gain from converting a TImode chain to =
V1TImode.
> > > >         (timode_scalar_chain::convert_op): New method to convert =
an
> > > >         operand from TImode to V1TImode.
> > > >
> > > >         (timode_scalar_chain::convert_insn) <case REG>: Only =
PUT_MODE
> > > >         on REG_EQUAL notes that were originally TImode (not =
CONST_INT).
> > > >         Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
> > > >         (timode_mem_p): Helper predicate to check where operand =
is
> > > >         memory reference with sufficient alignment for TImode =
STV.
> > > >         (timode_scalar_to_vector_candidate_p): Use
> convertible_comparison_p
> > > >         to check whether COMPARE is convertible.  Handle =
SET_DESTs that
> > > >         that are REG_P or MEM_P and SET_SRCs that are REG, =
CONST_INT,
> > > >         CONST_WIDE_INT, MEM, AND, ANDN, IOR, XOR or NOT.
> > > >
> > > > gcc/testsuite/ChangeLog
> > > >         * gcc.target/i386/sse4_1-stv-2.c: New test case, pand.
> > > >         * gcc.target/i386/sse4_1-stv-3.c: New test case, por.
> > > >         * gcc.target/i386/sse4_1-stv-4.c: New test case, pxor.
> > > >         * gcc.target/i386/sse4_1-stv-5.c: New test case, pandn.
> > > >         * gcc.target/i386/sse4_1-stv-6.c: New test case, ptest.
> > > >
> > > > Roger
> > > > --
> > > >
> >
>=20
>=20
> --
> H.J.