From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <roger@nextmovesoftware.com>
Received: from server.nextmovesoftware.com (server.nextmovesoftware.com
 [162.254.253.69])
 by sourceware.org (Postfix) with ESMTPS id C301B3857BB2
 for <gcc-patches@gcc.gnu.org>; Thu, 14 Jul 2022 05:31:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C301B3857BB2
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=nextmovesoftware.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=nextmovesoftware.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=nextmovesoftware.com; s=default; h=Content-Transfer-Encoding:Content-Type:
 MIME-Version:Message-ID:Date:Subject:In-Reply-To:References:Cc:To:From:Sender
 :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:
 Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:
 List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=PMv6aqQDtR8kxBiN5GSmSINpEhYTYxr1lp/glukT2FI=; b=rdJygKO6TeByljy6IQr1Am1qdJ
 6qnJVsx8w37I+yqOH5hhL6QKLFqtrsrvHUQkqKmJ/vTAH50I91c9kboMvuDLEt3OeWw8AnYYGuIHq
 JGjWdOmR36HXJ9w3sRwDpWPJfDQC+7ZYlUhBpyhg4ZaOGqcBBrD7HxpvGQcd4Q0Khck5H8+RArzmN
 hR1nLGPFWRcdbXlP3KXcFjLRjx6NunW4RpJBAI1/73MVLajF/3FMfecH9/iNpDF4fZKQkoQDc+jWC
 EqcAiHQkTtieX/aD7wvh3eVRB+tZlL9jVubF4bHzO/RvlP9xjxfHfNp8Tz4AogwFuP3C93ZatkUE/
 Lc2UCvMA==;
Received: from host109-154-33-170.range109-154.btcentralplus.com
 ([109.154.33.170]:50439 helo=Dell)
 by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <roger@nextmovesoftware.com>)
 id 1oBrRy-0003Bg-SA; Thu, 14 Jul 2022 01:31:47 -0400
From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'H.J. Lu'" <hjl.tools@gmail.com>
Cc: "'Uros Bizjak'" <ubizjak@gmail.com>,
 "'GCC Patches'" <gcc-patches@gcc.gnu.org>
References: <000901d8938d$ead4dc40$c07e94c0$@nextmovesoftware.com>
 <CAFULd4YY=ZG089XjQjrxmDR2Hy_ZD0eoXsjfVFz0xe=JioxhcQ@mail.gmail.com>
 <00f201d8948b$ec82a6e0$c587f4a0$@nextmovesoftware.com>
 <CAMe9rOrYp5pmU26NpcbV-f16tpJ6CVvtc5Szvp2TSW3egu0CyA@mail.gmail.com>
 <014a01d894a5$71189220$5349b660$@nextmovesoftware.com>
 <CAMe9rOpds-iMyOLD9C5ZAzg2KQ=Mc8udAhYJRrY=4wAbYR=Kyw@mail.gmail.com>
In-Reply-To: <CAMe9rOpds-iMyOLD9C5ZAzg2KQ=Mc8udAhYJRrY=4wAbYR=Kyw@mail.gmail.com>
Subject: RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode
 to V1TImode.
Date: Thu, 14 Jul 2022 06:31:44 +0100
Message-ID: <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQDLEU72DOSSolIGHYPmM3CDaDY5HQFm5hbmAaCooigCBX1m0gFZl94kAl2bbA6vUjanMA==
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com
X-AntiAbuse: Original Domain - gcc.gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - nextmovesoftware.com
X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id:
 roger@nextmovesoftware.com
X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Jul 2022 05:31:49 -0000


On Mon, Jul 11, 2022, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle =
<roger@nextmovesoftware.com>
> wrote:
> > Hi HJ,
> >
> > I believe this should now be handled by the post-reload (CSE) pass.
> > Consider the simple test case:
> >
> > __int128 a, b, c;
> > void foo()
> > {
> >   a =3D 0;
> >   b =3D 0;
> >   c =3D 0;
> > }
> >
> > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> >         movq    $0, a(%rip)
> >         movq    $0, a+8(%rip)
> >         movq    $0, b(%rip)
> >         movq    $0, b+8(%rip)
> >         movq    $0, c(%rip)
> >         movq    $0, c+8(%rip)
> >         ret
> >
> > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         movaps  %xmm0, b(%rip)
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > You're quite right internally the STV actually generates the =
equivalent of:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, b(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > And currently because STV run before cse2 and combine, the =
const0_rtx
> > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > you'll see we continue to generate the same optimized code, as the
> > same const0_rtx gets CSE'd in postreload.
> >
> > I can't be certain until I try the experiment, but I believe that =
the
> > postreload CSE will clean-up, all of the same common subexpressions.
> > Hence, it should be safe to perform all STV at the same point (after
> > combine), which for a few additional optimizations.
> >
> > Does this make sense?  Do you have a test case,
> > -fno-rerun-cse-after-loop produces different/inferior code for =
TImode STV
> chains?
> >
> > My guess is that the RTL passes have changed so much in the last six
> > or seven years, that some of the original motivation no longer =
applies.
> > Certainly we now try to keep TI mode operations visible longer, and
> > then allow STV to behave like a pre-reload pass to decide which set =
of
> > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > opportunities that cse2 finds with V1TI mode, could/should equally
> > well be found for TI mode (mostly).
>=20
> You are probably right.  If there are no regressions in GCC testsuite, =
my original
> motivation is no longer valid.

It was good to try the experiment, but H.J. is right, there is still =
some benefit
(as well as some disadvantages)  to running STV lowering before =
CSE2/combine.
A clean-up patch to perform all STV conversion as a single pass =
(removing a
pass from the compiler) results in just a single regression in the test =
suite:
FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal =
8
which looks like:

__int128 a, b, c, d, e, f;
void foo (void)
{
  a =3D 0;
  b =3D -1;
  c =3D 0;
  d =3D -1;
  e =3D 0;
  f =3D -1;
}

By performing STV after combine (without CSE), reload prefers to =
implement
this function using a single register, that then requires 12 =
instructions rather
than 8 (if using two registers).  Alas there's nothing that postreload =
CSE/GCSE
can do.  Doh!

        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, d(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, e(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, f(%rip)
        ret

I also note that even without STV, the scalar implementation of this =
function when
compiled with -Os is also larger than it needs to be due to poor CSE =
(notice in the
following we only need a single zero register, and  an all_ones reg =
would be helpful).

        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        movq    $-1, b(%rip)
        movq    %rax, a(%rip)
        movq    %rax, a+8(%rip)
        movq    $-1, b+8(%rip)
        movq    %rdx, c(%rip)
        movq    %rdx, c+8(%rip)
        movq    $-1, d(%rip)
        movq    $-1, d+8(%rip)
        movq    %rcx, e(%rip)
        movq    %rcx, e+8(%rip)
        movq    $-1, f(%rip)
        movq    $-1, f+8(%rip)
        ret

I need to give the problem some more thought.  It would be good to =
clean-up/unify
the STV passes, but I/we need to solve/CSE HJ's last test case before we =
do.  Perhaps
by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would =
force the zero
register to become visible, and CSE'd, benefiting both vector code and =
scalar -Os code,
then use postreload/peephole2 to fix up the remaining scalar cases.  =
It's tricky.

Cheers,
Roger
--