Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <richard.guenther@gmail.com>
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
Date: Thu, 14 Jul 2022 09:10:28 +0200	[thread overview]
Message-ID: <CAFiYyc3gvQr0rUY0s9-Hq6A6QuoXfJKK1iDdGHHW5i+eKssp-g@mail.gmail.com> (raw)
In-Reply-To: <000c01d89743$0278e4f0$076aaed0$@nextmovesoftware.com>

On Thu, Jul 14, 2022 at 7:32 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> On Mon, Jul 11, 2022, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > > Hi HJ,
> > >
> > > I believe this should now be handled by the post-reload (CSE) pass.
> > > Consider the simple test case:
> > >
> > > __int128 a, b, c;
> > > void foo()
> > > {
> > >   a = 0;
> > >   b = 0;
> > >   c = 0;
> > > }
> > >
> > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> > >         movq    $0, a(%rip)
> > >         movq    $0, a+8(%rip)
> > >         movq    $0, b(%rip)
> > >         movq    $0, b+8(%rip)
> > >         movq    $0, c(%rip)
> > >         movq    $0, c+8(%rip)
> > >         ret
> > >
> > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         movaps  %xmm0, b(%rip)
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > You're quite right internally the STV actually generates the equivalent of:
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, a(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, b(%rip)
> > >         pxor    %xmm0, %xmm0
> > >         movaps  %xmm0, c(%rip)
> > >         ret
> > >
> > > And currently because STV run before cse2 and combine, the const0_rtx
> > > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > > you'll see we continue to generate the same optimized code, as the
> > > same const0_rtx gets CSE'd in postreload.
> > >
> > > I can't be certain until I try the experiment, but I believe that the
> > > postreload CSE will clean-up, all of the same common subexpressions.
> > > Hence, it should be safe to perform all STV at the same point (after
> > > combine), which for a few additional optimizations.
> > >
> > > Does this make sense?  Do you have a test case,
> > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> > chains?
> > >
> > > My guess is that the RTL passes have changed so much in the last six
> > > or seven years, that some of the original motivation no longer applies.
> > > Certainly we now try to keep TI mode operations visible longer, and
> > > then allow STV to behave like a pre-reload pass to decide which set of
> > > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > > opportunities that cse2 finds with V1TI mode, could/should equally
> > > well be found for TI mode (mostly).
> >
> > You are probably right.  If there are no regressions in GCC testsuite, my original
> > motivation is no longer valid.
>
> It was good to try the experiment, but H.J. is right, there is still some benefit
> (as well as some disadvantages)  to running STV lowering before CSE2/combine.
> A clean-up patch to perform all STV conversion as a single pass (removing a
> pass from the compiler) results in just a single regression in the test suite:
> FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
> which looks like:
>
> __int128 a, b, c, d, e, f;
> void foo (void)
> {
>   a = 0;
>   b = -1;
>   c = 0;
>   d = -1;
>   e = 0;
>   f = -1;
> }
>
> By performing STV after combine (without CSE), reload prefers to implement
> this function using a single register, that then requires 12 instructions rather
> than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
> can do.  Doh!

Hmm, the RA could be taught to make use of more of the register file I suppose
(shouldn't regrename do this job - but it runs after postreload-cse)

>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, a(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, b(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, c(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, d(%rip)
>         pxor    %xmm0, %xmm0
>         movaps  %xmm0, e(%rip)
>         pcmpeqd %xmm0, %xmm0
>         movaps  %xmm0, f(%rip)
>         ret
>
> I also note that even without STV, the scalar implementation of this function when
> compiled with -Os is also larger than it needs to be due to poor CSE (notice in the
> following we only need a single zero register, and  an all_ones reg would be helpful).
>
>         xorl    %eax, %eax
>         xorl    %edx, %edx
>         xorl    %ecx, %ecx
>         movq    $-1, b(%rip)
>         movq    %rax, a(%rip)
>         movq    %rax, a+8(%rip)
>         movq    $-1, b+8(%rip)
>         movq    %rdx, c(%rip)
>         movq    %rdx, c+8(%rip)
>         movq    $-1, d(%rip)
>         movq    $-1, d+8(%rip)
>         movq    %rcx, e(%rip)
>         movq    %rcx, e+8(%rip)
>         movq    $-1, f(%rip)
>         movq    $-1, f+8(%rip)
>         ret
>
> I need to give the problem some more thought.  It would be good to clean-up/unify
> the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  Perhaps
> by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero
> register to become visible, and CSE'd, benefiting both vector code and scalar -Os code,
> then use postreload/peephole2 to fix up the remaining scalar cases.  It's tricky.

Not sure if related but ppc(?) folks recently tried to massage CSE to
avoid propagating
constants by making sure that rtx_cost handles (set (...) (const_int
...)) "properly".
But IIRC CSE never does the reverse transform - split out a constant
to a pseudo from
multiple uses of the same constant - that's probably on the job of
reload + postreload-CSE
right now, but reload probably does not know that there are multiple
uses of the constant
so the splitting is worthwhile.

> Cheers,
> Roger
> --
>
>

next prev parent reply	other threads:[~2022-07-14  7:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-09 12:17 Roger Sayle
2022-07-10 18:05 ` Uros Bizjak
2022-07-10 18:36   ` Roger Sayle
2022-07-10 19:15     ` H.J. Lu
2022-07-10 21:38       ` Roger Sayle
2022-07-10 23:56         ` H.J. Lu
2022-07-14  5:31           ` Roger Sayle
2022-07-14  7:10             ` Richard Biener [this message]
2022-07-11  6:41     ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFiYyc3gvQr0rUY0s9-Hq6A6QuoXfJKK1iDdGHHW5i+eKssp-g@mail.gmail.com \
    --to=richard.guenther@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hjl.tools@gmail.com \
    --cc=roger@nextmovesoftware.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).