RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'H.J. Lu'" <hjl.tools@gmail.com>
Cc: "'Uros Bizjak'" <ubizjak@gmail.com>,
	"'GCC Patches'" <gcc-patches@gcc.gnu.org>
Subject: RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
Date: Thu, 14 Jul 2022 06:31:44 +0100	[thread overview]
Message-ID: <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com> (raw)
In-Reply-To: <CAMe9rOpds-iMyOLD9C5ZAzg2KQ=Mc8udAhYJRrY=4wAbYR=Kyw@mail.gmail.com>


On Mon, Jul 11, 2022, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> > Hi HJ,
> >
> > I believe this should now be handled by the post-reload (CSE) pass.
> > Consider the simple test case:
> >
> > __int128 a, b, c;
> > void foo()
> > {
> >   a = 0;
> >   b = 0;
> >   c = 0;
> > }
> >
> > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> >         movq    $0, a(%rip)
> >         movq    $0, a+8(%rip)
> >         movq    $0, b(%rip)
> >         movq    $0, b+8(%rip)
> >         movq    $0, c(%rip)
> >         movq    $0, c+8(%rip)
> >         ret
> >
> > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         movaps  %xmm0, b(%rip)
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > You're quite right internally the STV actually generates the equivalent of:
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, a(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, b(%rip)
> >         pxor    %xmm0, %xmm0
> >         movaps  %xmm0, c(%rip)
> >         ret
> >
> > And currently because STV run before cse2 and combine, the const0_rtx
> > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > you'll see we continue to generate the same optimized code, as the
> > same const0_rtx gets CSE'd in postreload.
> >
> > I can't be certain until I try the experiment, but I believe that the
> > postreload CSE will clean-up, all of the same common subexpressions.
> > Hence, it should be safe to perform all STV at the same point (after
> > combine), which for a few additional optimizations.
> >
> > Does this make sense?  Do you have a test case,
> > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> chains?
> >
> > My guess is that the RTL passes have changed so much in the last six
> > or seven years, that some of the original motivation no longer applies.
> > Certainly we now try to keep TI mode operations visible longer, and
> > then allow STV to behave like a pre-reload pass to decide which set of
> > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > opportunities that cse2 finds with V1TI mode, could/should equally
> > well be found for TI mode (mostly).
> 
> You are probably right.  If there are no regressions in GCC testsuite, my original
> motivation is no longer valid.

It was good to try the experiment, but H.J. is right, there is still some benefit
(as well as some disadvantages)  to running STV lowering before CSE2/combine.
A clean-up patch to perform all STV conversion as a single pass (removing a
pass from the compiler) results in just a single regression in the test suite:
FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
which looks like:

__int128 a, b, c, d, e, f;
void foo (void)
{
  a = 0;
  b = -1;
  c = 0;
  d = -1;
  e = 0;
  f = -1;
}

By performing STV after combine (without CSE), reload prefers to implement
this function using a single register, that then requires 12 instructions rather
than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
can do.  Doh!

        pxor    %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, b(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, c(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, d(%rip)
        pxor    %xmm0, %xmm0
        movaps  %xmm0, e(%rip)
        pcmpeqd %xmm0, %xmm0
        movaps  %xmm0, f(%rip)
        ret

I also note that even without STV, the scalar implementation of this function when
compiled with -Os is also larger than it needs to be due to poor CSE (notice in the
following we only need a single zero register, and  an all_ones reg would be helpful).

        xorl    %eax, %eax
        xorl    %edx, %edx
        xorl    %ecx, %ecx
        movq    $-1, b(%rip)
        movq    %rax, a(%rip)
        movq    %rax, a+8(%rip)
        movq    $-1, b+8(%rip)
        movq    %rdx, c(%rip)
        movq    %rdx, c+8(%rip)
        movq    $-1, d(%rip)
        movq    $-1, d+8(%rip)
        movq    %rcx, e(%rip)
        movq    %rcx, e+8(%rip)
        movq    $-1, f(%rip)
        movq    $-1, f+8(%rip)
        ret

I need to give the problem some more thought.  It would be good to clean-up/unify
the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  Perhaps
by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero
register to become visible, and CSE'd, benefiting both vector code and scalar -Os code,
then use postreload/peephole2 to fix up the remaining scalar cases.  It's tricky.

Cheers,
Roger
--

next prev parent reply	other threads:[~2022-07-14  5:31 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-09 12:17 Roger Sayle
2022-07-10 18:05 ` Uros Bizjak
2022-07-10 18:36   ` Roger Sayle
2022-07-10 19:15     ` H.J. Lu
2022-07-10 21:38       ` Roger Sayle
2022-07-10 23:56         ` H.J. Lu
2022-07-14  5:31           ` Roger Sayle [this message]
2022-07-14  7:10             ` Richard Biener
2022-07-11  6:41     ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com' \
    --to=roger@nextmovesoftware.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hjl.tools@gmail.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).